NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Error during J-discovery in IgDiscover v0.9 #81

Closed willbradshaw closed 6 years ago

willbradshaw commented 6 years ago

Hi there,

I recently installed IgDiscover (v0.9) with conda and tried to run the test analysis as specified on the website. However, I'm consistently getting the following error:

ERROR: No J genes were discovered in this iteration (file 'iteration-01/new_J.fasta' is empty)! Cannot continue.
Check whether the starting database is of the correct chain type (heavy, light lambda, light kappa). It needs to match the type of sequences you analyze.

This seems to be arising due to igdiscover discoverj not writing any genes in a previous step:

INFO: 1 records
INFO: After filtering by allele ratio and/or cross-mapping ratio, 1 candidates remain
INFO: Wrote 0 genes

I tried re-running the test analysis using IgDiscover v0.8 (installed in a separate conda environment) and this error did not occur (though the run still failed later on for a different reason). The v0.8 and v0.9 behaviours are consistent across my local machine and our cluster; I've attached the log from my local v0.9 run. I get the same error when I try to run the v0.9 pipeline on my own data.

igdiscover-test.log

I'm guessing this is something arising from the changes in J-discovery that were made at the last version; I couldn't find anything in the documentation about how to handle this. Is this a bug, or am I missing something?

MartinMatthewC commented 6 years ago

Hi Will,

That sounds like a bug - I will pass this information on to Marcel who is involved in the coding. Thanks for letting us know this. I am not seeing this error with my last series of runs (human IgM libraries igdiscover development version 0.9+12.gc7a9d02) and so it may be connected with either the particular version you have been using or some combination of the library type and the config file settings.

We recently updated the J gene setting so that it would 'fix' the Js in the first iteration (since you we find that all the J alleles are discoverable in the first iteration and so there is no need to rediscover them in subsequent iteration, primarily because we were seeing the error you noticed from the v0.8 version, namely that the Js discovered in the first iteration were then .)

Is it possible for you to send the igdiscover.yaml config file to us so that we can check this? Also, is your starting database contains a J.fasta file containing the reference J genes expected in your library? And finally, is your library one of IgM, IgG or IgK (or even a mixture of them - it will still work so long as your reference database is also a mix of IGH Vs, Ds and Js, and IGK and IGL Vs and Js.)

Martin

PS, I just saw your next email.

When we work with non-mammalian systems we can still get igdiscover to work for Vs but Js are certainly more difficult. The program works better if you have at least some of the correct Js within your starting database and it is also much more efficient if you can get the length of the Vs and Js correct (the length is more important than the exact sequence in the starting database since the program will correct the sequence but it may give you incorrect length since it bases that on the length of the reference/starting database).

One tip for working with non mammalian species is to set the igdiscover.yaml file ignore_j option as follows the first time you try it:

Candidate discovery settings

When discovering new V genes, ignore whether a J gene has been assigned

and also ignore its %SHM.

true: yes, ignore the J

false: do not ignore J assignment, do not ignore its %SHM

# ignore_j: true

This allows you to get some Vs identified - after which it is easier to identify the Js (at which point rerun the program with the following setting:

ignore_j: false


From: Will Bradshaw [notifications@github.com] Sent: Thursday, April 05, 2018 3:37 PM To: NBISweden/IgDiscover Cc: Subscribed Subject: [NBISweden/IgDiscover] Error during J-discovery in IgDiscover v0.9 (#81)

Hi there,

I recently installed IgDiscover (v0.9) with conda and tried to run the test analysis as specified on the website. However, I'm consistently getting the following error:

ERROR: No J genes were discovered in this iteration (file 'iteration-01/new_J.fasta' is empty)! Cannot continue. Check whether the starting database is of the correct chain type (heavy, light lambda, light kappa). It needs to match the type of sequences you analyze.

This seems to be arising due to igdiscover discoverj not writing any genes in a previous step:

INFO: 1 records INFO: After filtering by allele ratio and/or cross-mapping ratio, 1 candidates remain INFO: Wrote 0 genes

I tried re-running the test analysis using IgDiscover v0.8 (installed in a separate conda environment) and this error did not occur (though the run still failed later on for a different reason). The v0.8 and v0.9 behaviours are consistent across my local machine and our cluster; I've attached the log from my local v0.9 run. I get the same error when I try to run the v0.9 pipeline on my own data.

I'm guessing this is something arising from the changes in J-discovery that were made at the last version; I couldn't find anything in the documentation about how to handle this. Is this a bug, or am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NBISweden/IgDiscover/issues/81, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIUaLXYZSo7v46LgY7Z4le09phKEM-R4ks5tlh4fgaJpZM4TId7A.

marcelm commented 6 years ago

In the previous IgDiscover version 0.8, J discovery was possible, but not done as part of the normal pipeline, so you would have needed to run igdiscover discoverj separately with the appropriate options. In IgDiscover 0.9, it is run as part of igdiscover run and then you get these errors when it doesn’t work. To get rid of the error, set

j_discover:
  ...
  propagate: false

at the bottom of igdiscovery.yaml. J discovery will still be done in iteration 1 (as @MartinMatthewC explained), but the found Js will not be used in any iteration.

I noted from your log file that your dataset is quite small: There are just a couple of thousand sequences in the input and after preprocessing, only 1044 sequences end up being run through IgBLAST. It’s quite likely that this also is a reason why it fails.

willbradshaw commented 6 years ago

Hi Martin!

The bug I'm reporting here is for the test data set from the website, so the igdiscover.yaml config file is the one provided there; I just copied it into the test directory as specified here: igdiscover.yaml.zip

My own data is an IgM library; I have the IgH locus partially assembled so I have some valid germline Vs and Js already. :)

Thanks! Will

willbradshaw commented 6 years ago

Hi Marcel,

Thanks for the information! I re-ran the test analysis with propagate: false and now it runs without errors, finding 1 new V and 0 new Js. Is this as expected?

The dataset from the log file is the one provided on the igdiscover.se website for testing, not my own data; now I've got it working on the test I will have another look at my own read sets.

Is there a test dataset I can use to confirm that inference of missing Js is working as expected? I'm pretty sure I'm missing some germline Js from my databases at the moment.

marcelm commented 6 years ago

The dataset from the log file is the one provided on the igdiscover.se website for testing, not my own data

Ah of course, thanks. I will then need to update the instructions for how to run the test dataset.

finding 1 new V and 0 new Js. Is this as expected?

Yes, I created the test dataset so that IgDiscover would find just a single V. The dataset predates J discovery, which is why no Js are found.

Is there a test dataset I can use to confirm that inference of missing Js is working as expected? I'm pretty sure I'm missing some germline Js from my databases at the moment.

I would just run it and see what the results are. If you could run the test dataset, I would be quite confident that IgDiscover itself works. If you have a huge dataset, you could test it on only the first 1 million reads or less (set limit: 1000000 in the configuration) so you see the results a bit quicker. We’ve done our testing mostly on non-public datasets, which I cannot provide at the moment. I’m actively working on making IgDiscover results available for some datasets that are on SRA, but even that would be only for V discovery at the moment.

willbradshaw commented 6 years ago

Okay, I ran the whole pipeline on one of my complete datasets today (with J-propagation disabled), and apart from the CDR3 detection issue I mentioned in another thread it seems to work fine now. :)

marcelm commented 6 years ago

Ah, just saw that this can be closed, thanks for letting us know!