NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Failure to identify V genes #97

Closed carollia closed 5 years ago

carollia commented 5 years ago

Hello,

I am trying to identify V genes from Laurasiatherian mammals and the analysis keeps hitting an error in iteration 1 when it tries to identify V genes (usually around 60-70% completion). As an initial test I used just 550 sequences, all of which merged. I have tried running IgDiscover numerous times -- using the default .yaml file, playing around with the filter settings and ignoring J (which caused it to fail because no Js were identified); I have also tried using human V, D and J genes as references and also genes from more closely related mammals. Because my data are quite distantly diverged from humans, for example, the inferred V mutation rates are 10-30%; I don't know if this could be causing the issue.

I would appreciate any insights into what I might be doing wrong. It seems like some how there are no V genes remaining after prefiltering so maybe I have to change one of those parameters? I have included a representative new_V_pregermline.log output below as well as a common error that I have been getting. Please let me know if I can provide any other information as I am not sure what is relevant for troubleshooting.

Thank you so much for your help! Hannah

Looking at the new_V_pregermline.log or other V log files it seems like no V genes are remaining (example below):

INFO: 338 unique sequences in whitelist INFO: Marked 0 candidates as 'too_low_dbdiff' INFO: Marked 0 candidates as 'too_low_CDR3s_exact' INFO: Marked 0 candidates as 'too_high_CDR3_shared_ratio' INFO: Marked 0 candidates as 'too_low_Js_exact' INFO: Marked 0 candidates as 'too_low_cluster_size' INFO: Table read from 'iteration-01/candidates.tab' contains 0 candidate V gene sequences. 0 remain after per-entry filtering INFO: Of those, 0 are protected by the whitelist Traceback (most recent call last): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in sys.exit(main()) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main args.func(args) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/germlinefilter.py", line 463, in main overall_table.loc[overall_table.is_filtered > 0, 'is_duplicate'] = False File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in setitem self._setitem_with_indexer(indexer, value) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 356, in _setitem_with_indexer raise ValueError("cannot set a frame with no " ValueError: cannot set a frame with no defined index and a scalar

Another common error I get is: RuleException: CalledProcessError in line 688 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile: Command ' set -euo pipefail;  igdiscover germlinefilter --whitelist=database/V.fasta --unique-CDR3=2 --cluster-size=1 --unique-J=2 --cross-mapping-ratio=0.02 --clonotype-ratio=0.01 --exact-ratio=0.01 --cdr3-shared-ratio=0.8 --unique-D-ratio=0.1 --unique-D-threshold=10 --annotate=iteration-01/annotated_V_germline.tab --fasta=iteration-01/new_V_germline.fasta iteration-01/candidates.tab  2> >(tee iteration-01/new_V_germline.log >&2)  > iteration-01/new_V_germline.tab ' returned non-zero exit status 1.

MartinMatthewC commented 5 years ago

Hi Hannah,

Did you mean you used a database of 550 sequences or you analyzed 550 sequences in total? We generally analyze in the order of hundreds of thousands of sequences from NGS runs, although it can work to some extent using tens of thousands of sequences. It is important that the library contains enough naive sequences as the program uses a clustering algorithm to identify the germline sequences. We usually use IgM/IgD libraries but light chain IgK and IgL will usually work.

For new species it is best that you try to create a starting database that is as close as you can find to at least some of the genes present in that species when you begin. For example by extracting sequences from a reference genome. It doesn't even have to be the same species for it to work in the program but the more different the starting database is from your test species the greater the chance that you will encounter problems. The most common cause of problems is issues related to J alleles as the program uses the number of J alleles as a primary means of identifying independent recombinations and it is the number of independent recombinations that gives confidence in V allele inference. What I suggest you do is set the unique_cdr3s and unique_js to 2 in both pregremline and germline settings in the igdiscover.yaml file and run the program. It will probably stop at the first iteration but it may have identified some J allele sequences that you can then add to your J database that will allow you to get more information the next time you run the program. I hope this is of some help, Martin


From: carollia notifications@github.com Sent: Friday, May 3, 2019 1:00 AM To: NBISweden/IgDiscover Cc: Subscribed Subject: [NBISweden/IgDiscover] Failure to identify V genes (#97)

Hello,

I am trying to identify V genes from Laurasiatherian mammals and the analysis keeps hitting an error in iteration 1 when it tries to identify V genes (usually around 60-70% completion). As an initial test I used just 550 sequences, all of which merged. I have tried running IgDiscover numerous times -- using the default .yaml file, playing around with the filter settings and ignoring J (which caused it to fail because no Js were identified); I have also tried using human V, D and J genes as references and also genes from more closely related mammals. Because my data are quite distantly diverged from humans, for example, the inferred V mutation rates are 10-30%; I don't know if this could be causing the issue.

I would appreciate any insights into what I might be doing wrong. It seems like some how there are no V genes remaining after prefiltering so maybe I have to change one of those parameters? I have included a representative new_V_pregermline.log output below as well as a common error that I have been getting. Please let me know if I can provide any other information as I am not sure what is relevant for troubleshooting.

Thank you so much for your help! Hannah

Looking at the new_V_pregermline.log or other V log files it seems like no V genes are remaining (example below):

INFO: 338 unique sequences in whitelist INFO: Marked 0 candidates as 'too_low_dbdiff' INFO: Marked 0 candidates as 'too_low_CDR3s_exact' INFO: Marked 0 candidates as 'too_high_CDR3_shared_ratio' INFO: Marked 0 candidates as 'too_low_Js_exact' INFO: Marked 0 candidates as 'too_low_cluster_size' INFO: Table read from 'iteration-01/candidates.tab' contains 0 candidate V gene sequences. 0 remain after per-entry filtering INFO: Of those, 0 are protected by the whitelist Traceback (most recent call last): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in sys.exit(main()) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main args.func(args) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/germlinefilter.py", line 463, in main overall_table.loc[overall_table.is_filtered > 0, 'is_duplicate'] = False File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in setitem self._setitem_with_indexer(indexer, value) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 356, in _setitem_with_indexer raise ValueError("cannot set a frame with no " ValueError: cannot set a frame with no defined index and a scalar

Another common error I get is: RuleException: CalledProcessError in line 688 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile: Command ' set -euo pipefail; igdiscover germlinefilter --whitelist=database/V.fasta --unique-CDR3=2 --cluster-size=1 --unique-J=2 --cross-mapping-ratio=0.02 --clonotype-ratio=0.01 --exact-ratio=0.01 --cdr3-shared-ratio=0.8 --unique-D-ratio=0.1 --unique-D-threshold=10 --annotate=iteration-01/annotated_V_germline.tab --fasta=iteration-01/new_V_germline.fasta iteration-01/candidates.tab 2> >(tee iteration-01/new_V_germline.log >&2) > iteration-01/new_V_germline.tab ' returned non-zero exit status 1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NBISweden/IgDiscover/issues/97, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACCRULOXWAF5XRJEB4YQT5LPTNXANANCNFSM4HKLSG2A.

När du skickar e-post till Karolinska Institutet (KI) innebär detta att KI kommer att behandla dina personuppgifter. Här finns information om hur KI behandlar personuppgifterhttps://ki.se/medarbetare/integritetsskyddspolicy.

Sending email to Karolinska Institutet (KI) will result in KI processing your personal data. You can read more about KI’s processing of personal data herehttps://ki.se/en/staff/data-protection-policy.

carollia commented 5 years ago

Hi Martin,

Thank you for your reply and advice. I was running it with a very low number but now I am running the pipeline with 2,572,294 IgM reads that were able to be merged with PEAR. I am using starting V genes from a related species as well as the results from blasting human and other laurasiatherian V genes to my target genome. I did a similar thing to pull out "J genes" from my target species' genome. I also igblasted 10,000 sequences to human and took a few common, long inferred J sequences found in IgMs from that analysis as starting J genes. For the D genes I used inferred D sequences from the IgBlast analysis I did for the Js. I also changed the unique_cdr3s and unique_js to 2 in both the pregermline and germline settings as you suggested. Nonetheless I keep getting the following error:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in

sys.exit(main())

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main

args.func(args)

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 375, in main

raw_output=raw_output, use_cache=use_cache):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 344, in igblast

for igblast_output, igblast_records in pool.imap(runner, chunks,

chunksize=1):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/multiprocessing/pool.py", line 761, in next

raise value

KeyError: 'gb|GQ923630.1|'

[Mon May 27 13:14:51 2019]

Error in rule igdiscover_igblast:

jobid: 52

output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json

log: iteration-01/igblast.log

RuleException:

CalledProcessError in line 533 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile:

Command ' set -euo pipefail; time igdiscover igblast --sequence-type=Ig --threads=16 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz 2>&1 > iteration-01/assigned.tab.gz | tee iteration-01/igblast.log >&2 ' returned non-zero exit status 1.

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile", line 533, in __rule_igdiscover_igblast

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/concurrent/futures/thread.py", line 56, in run

Removing output files of failed job igdiscover_igblast since they might be corrupted:

iteration-01/assigned.tab.gz

Shutting down, this might take some time.

Exiting because a job execution failed. Look above for error message

I tried removing the sequence for which it said there was a key error and rerunning it but it just failed at the same place citing a different sequence. I am running the program with all the default settings of the yaml file except that I have switched the number of iterations to 2, uncommented out pear as the aligner and changed the germline settings for unique_cdr3s and unique_js to 2. My starting databases have 182 V genes, 34 J genes and 90 D genes. Do you have any idea how to fix this issue or what I might be doing wrong? Please let me know if I can provide any additional information.

Thanks so much!

Cheers, Hannah

On Thu, May 2, 2019 at 8:11 PM MartinMatthewC notifications@github.com wrote:

Hi Hannah,

Did you mean you used a database of 550 sequences or you analyzed 550 sequences in total? We generally analyze in the order of hundreds of thousands of sequences from NGS runs, although it can work to some extent using tens of thousands of sequences. It is important that the library contains enough naive sequences as the program uses a clustering algorithm to identify the germline sequences. We usually use IgM/IgD libraries but light chain IgK and IgL will usually work.

For new species it is best that you try to create a starting database that is as close as you can find to at least some of the genes present in that species when you begin. For example by extracting sequences from a reference genome. It doesn't even have to be the same species for it to work in the program but the more different the starting database is from your test species the greater the chance that you will encounter problems. The most common cause of problems is issues related to J alleles as the program uses the number of J alleles as a primary means of identifying independent recombinations and it is the number of independent recombinations that gives confidence in V allele inference. What I suggest you do is set the unique_cdr3s and unique_js to 2 in both pregremline and germline settings in the igdiscover.yaml file and run the program. It will probably stop at the first iteration but it may have identified some J allele sequences that you can then add to your J database that will allow you to get more information the next time you run the program. I hope this is of some help, Martin


From: carollia notifications@github.com Sent: Friday, May 3, 2019 1:00 AM To: NBISweden/IgDiscover Cc: Subscribed Subject: [NBISweden/IgDiscover] Failure to identify V genes (#97)

Hello,

I am trying to identify V genes from Laurasiatherian mammals and the analysis keeps hitting an error in iteration 1 when it tries to identify V genes (usually around 60-70% completion). As an initial test I used just 550 sequences, all of which merged. I have tried running IgDiscover numerous times -- using the default .yaml file, playing around with the filter settings and ignoring J (which caused it to fail because no Js were identified); I have also tried using human V, D and J genes as references and also genes from more closely related mammals. Because my data are quite distantly diverged from humans, for example, the inferred V mutation rates are 10-30%; I don't know if this could be causing the issue.

I would appreciate any insights into what I might be doing wrong. It seems like some how there are no V genes remaining after prefiltering so maybe I have to change one of those parameters? I have included a representative new_V_pregermline.log output below as well as a common error that I have been getting. Please let me know if I can provide any other information as I am not sure what is relevant for troubleshooting.

Thank you so much for your help! Hannah

Looking at the new_V_pregermline.log or other V log files it seems like no V genes are remaining (example below):

INFO: 338 unique sequences in whitelist INFO: Marked 0 candidates as 'too_low_dbdiff' INFO: Marked 0 candidates as 'too_low_CDR3s_exact' INFO: Marked 0 candidates as 'too_high_CDR3_shared_ratio' INFO: Marked 0 candidates as 'too_low_Js_exact' INFO: Marked 0 candidates as 'too_low_cluster_size' INFO: Table read from 'iteration-01/candidates.tab' contains 0 candidate V gene sequences. 0 remain after per-entry filtering INFO: Of those, 0 are protected by the whitelist Traceback (most recent call last): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in sys.exit(main()) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main args.func(args) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/germlinefilter.py", line 463, in main overall_table.loc[overall_table.is_filtered > 0, 'is_duplicate'] = False File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in setitem self._setitem_with_indexer(indexer, value) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 356, in _setitem_with_indexer raise ValueError("cannot set a frame with no " ValueError: cannot set a frame with no defined index and a scalar

Another common error I get is: RuleException: CalledProcessError in line 688 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile: Command ' set -euo pipefail; igdiscover germlinefilter --whitelist=database/V.fasta --unique-CDR3=2 --cluster-size=1 --unique-J=2 --cross-mapping-ratio=0.02 --clonotype-ratio=0.01 --exact-ratio=0.01 --cdr3-shared-ratio=0.8 --unique-D-ratio=0.1 --unique-D-threshold=10 --annotate=iteration-01/annotated_V_germline.tab --fasta=iteration-01/new_V_germline.fasta iteration-01/candidates.tab 2>

(tee iteration-01/new_V_germline.log >&2) > iteration-01/new_V_germline.tab ' returned non-zero exit status 1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub< https://github.com/NBISweden/IgDiscover/issues/97>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ACCRULOXWAF5XRJEB4YQT5LPTNXANANCNFSM4HKLSG2A

.

När du skickar e-post till Karolinska Institutet (KI) innebär detta att KI kommer att behandla dina personuppgifter. Här finns information om hur KI behandlar personuppgifter< https://ki.se/medarbetare/integritetsskyddspolicy>.

Sending email to Karolinska Institutet (KI) will result in KI processing your personal data. You can read more about KI’s processing of personal data herehttps://ki.se/en/staff/data-protection-policy.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/IgDiscover/issues/97#issuecomment-488909274, or mute the thread https://github.com/notifications/unsubscribe-auth/AJTDD7UPC2ZGCXDNTNWU3SLPTOUNJANCNFSM4HKLSG2A .

-- Hannah Frank Postdoctoral Research Fellow Dept. of Pathology, Stanford University hkfrank@stanford.edu

MartinMatthewC commented 5 years ago

Hi Carollia, It looks like you are approaching this the same way that I would - build up as close a startig database as possible and then try to get IgDiscover to infer the species specific database from your starting library. Unfortunately our main bioinformatician, Mateusz, who is better at guessing what these error messages mean, is away on vacation for the next two weeks so I won't be able to give you a definite answer before he returns. From looking at the error message myself I wonder if it is a CDR3 detection issue. Within IgDiscover there is a feature that stops the IgBLAST assignment if it fails to detect CDR3s in a certain proportion of the library.

This can be for two reasons. First, it can be due to the program failing to detect CDR3s due to some species specific issue - for example if the CDR3s have a specific nucleotide motif in your species that IgDiscover does not utilize then it will fail to detect a CDR3 in many sequences and then the 'stop' feature will be triggered and the analysis will end.

Second, it may be because the library is mixed, with only a low percentage of the merged sequences being Ig sequences of the correct type.

In both situations IgDiscover will fail to identify a CDR3 in enough sequences and will then end the analysis.

You can manually test the proportion of Ig sequences using the online IgBLAST - just take 100 merged sequences in fasta format and put them through the online IgBLAST (just use the human database). This will identify any sequences that look like Ig sequences and will identify a CDR3 for them. What you are looking for is the overall percentage of sequences that it doesn't identify as VH containing sequences.

If, for example it only identifies 30% of the sequences as VH this would probably be the reason why the IgDiscover program failed to work - and there is a simple solution - just grep out the Ig specific merged sequences (for example using a sequence from the constant region downstream of the amplification primer) and use that filtered list as the library to be analyzed. Best regards,

Martin


From: carollia notifications@github.com Sent: Tuesday, May 28, 2019 1:03 AM To: NBISweden/IgDiscover Cc: Martin Corcoran; Comment Subject: Re: [NBISweden/IgDiscover] Failure to identify V genes (#97)

Hi Martin,

Thank you for your reply and advice. I was running it with a very low number but now I am running the pipeline with 2,572,294 IgM reads that were able to be merged with PEAR. I am using starting V genes from a related species as well as the results from blasting human and other laurasiatherian V genes to my target genome. I did a similar thing to pull out "J genes" from my target species' genome. I also igblasted 10,000 sequences to human and took a few common, long inferred J sequences found in IgMs from that analysis as starting J genes. For the D genes I used inferred D sequences from the IgBlast analysis I did for the Js. I also changed the unique_cdr3s and unique_js to 2 in both the pregermline and germline settings as you suggested. Nonetheless I keep getting the following error:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in

sys.exit(main())

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main

args.func(args)

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 375, in main

raw_output=raw_output, use_cache=use_cache):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 344, in igblast

for igblast_output, igblast_records in pool.imap(runner, chunks, chunksize=1):

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/multiprocessing/pool.py", line 761, in next

raise value

KeyError: 'gb|GQ923630.1|'

[Mon May 27 13:14:51 2019]

Error in rule igdiscover_igblast:

jobid: 52

output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json

log: iteration-01/igblast.log

RuleException:

CalledProcessError in line 533 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile:

Command ' set -euo pipefail; time igdiscover igblast --sequence-type=Ig --threads=16 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz 2>&1 > iteration-01/assigned.tab.gz | tee iteration-01/igblast.log >&2 ' returned non-zero exit status 1.

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile", line 533, in __rule_igdiscover_igblast

File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/concurrent/futures/thread.py", line 56, in run

Removing output files of failed job igdiscover_igblast since they might be corrupted:

iteration-01/assigned.tab.gz

Shutting down, this might take some time.

Exiting because a job execution failed. Look above for error message

I tried removing the sequence for which it said there was a key error and rerunning it but it just failed at the same place citing a different sequence. I am running the program with all the default settings of the yaml file except that I have switched the number of iterations to 2, uncommented out pear as the aligner and changed the germline settings for unique_cdr3s and unique_js to 2. My starting databases have 182 V genes, 34 J genes and 90 D genes. Do you have any idea how to fix this issue or what I might be doing wrong? Please let me know if I can provide any additional information.

Thanks so much!

Cheers, Hannah

On Thu, May 2, 2019 at 8:11 PM MartinMatthewC notifications@github.com wrote:

Hi Hannah,

Did you mean you used a database of 550 sequences or you analyzed 550 sequences in total? We generally analyze in the order of hundreds of thousands of sequences from NGS runs, although it can work to some extent using tens of thousands of sequences. It is important that the library contains enough naive sequences as the program uses a clustering algorithm to identify the germline sequences. We usually use IgM/IgD libraries but light chain IgK and IgL will usually work.

For new species it is best that you try to create a starting database that is as close as you can find to at least some of the genes present in that species when you begin. For example by extracting sequences from a reference genome. It doesn't even have to be the same species for it to work in the program but the more different the starting database is from your test species the greater the chance that you will encounter problems. The most common cause of problems is issues related to J alleles as the program uses the number of J alleles as a primary means of identifying independent recombinations and it is the number of independent recombinations that gives confidence in V allele inference. What I suggest you do is set the unique_cdr3s and unique_js to 2 in both pregremline and germline settings in the igdiscover.yaml file and run the program. It will probably stop at the first iteration but it may have identified some J allele sequences that you can then add to your J database that will allow you to get more information the next time you run the program. I hope this is of some help, Martin


From: carollia notifications@github.com Sent: Friday, May 3, 2019 1:00 AM To: NBISweden/IgDiscover Cc: Subscribed Subject: [NBISweden/IgDiscover] Failure to identify V genes (#97)

Hello,

I am trying to identify V genes from Laurasiatherian mammals and the analysis keeps hitting an error in iteration 1 when it tries to identify V genes (usually around 60-70% completion). As an initial test I used just 550 sequences, all of which merged. I have tried running IgDiscover numerous times -- using the default .yaml file, playing around with the filter settings and ignoring J (which caused it to fail because no Js were identified); I have also tried using human V, D and J genes as references and also genes from more closely related mammals. Because my data are quite distantly diverged from humans, for example, the inferred V mutation rates are 10-30%; I don't know if this could be causing the issue.

I would appreciate any insights into what I might be doing wrong. It seems like some how there are no V genes remaining after prefiltering so maybe I have to change one of those parameters? I have included a representative new_V_pregermline.log output below as well as a common error that I have been getting. Please let me know if I can provide any other information as I am not sure what is relevant for troubleshooting.

Thank you so much for your help! Hannah

Looking at the new_V_pregermline.log or other V log files it seems like no V genes are remaining (example below):

INFO: 338 unique sequences in whitelist INFO: Marked 0 candidates as 'too_low_dbdiff' INFO: Marked 0 candidates as 'too_low_CDR3s_exact' INFO: Marked 0 candidates as 'too_high_CDR3_shared_ratio' INFO: Marked 0 candidates as 'too_low_Js_exact' INFO: Marked 0 candidates as 'too_low_cluster_size' INFO: Table read from 'iteration-01/candidates.tab' contains 0 candidate V gene sequences. 0 remain after per-entry filtering INFO: Of those, 0 are protected by the whitelist Traceback (most recent call last): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in sys.exit(main()) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main args.func(args) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/germlinefilter.py", line 463, in main overall_table.loc[overall_table.is_filtered > 0, 'is_duplicate'] = False File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in setitem self._setitem_with_indexer(indexer, value) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/pandas/core/indexing.py", line 356, in _setitem_with_indexer raise ValueError("cannot set a frame with no " ValueError: cannot set a frame with no defined index and a scalar

Another common error I get is: RuleException: CalledProcessError in line 688 of /home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/Snakefile: Command ' set -euo pipefail; igdiscover germlinefilter --whitelist=database/V.fasta --unique-CDR3=2 --cluster-size=1 --unique-J=2 --cross-mapping-ratio=0.02 --clonotype-ratio=0.01 --exact-ratio=0.01 --cdr3-shared-ratio=0.8 --unique-D-ratio=0.1 --unique-D-threshold=10 --annotate=iteration-01/annotated_V_germline.tab --fasta=iteration-01/new_V_germline.fasta iteration-01/candidates.tab 2>

(tee iteration-01/new_V_germline.log >&2) > iteration-01/new_V_germline.tab ' returned non-zero exit status 1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub< https://github.com/NBISweden/IgDiscover/issues/97>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ACCRULOXWAF5XRJEB4YQT5LPTNXANANCNFSM4HKLSG2A

.

När du skickar e-post till Karolinska Institutet (KI) innebär detta att KI kommer att behandla dina personuppgifter. Här finns information om hur KI behandlar personuppgifter< https://ki.se/medarbetare/integritetsskyddspolicy>.

Sending email to Karolinska Institutet (KI) will result in KI processing your personal data. You can read more about KI’s processing of personal data herehttps://ki.se/en/staff/data-protection-policy.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/IgDiscover/issues/97#issuecomment-488909274, or mute the thread https://github.com/notifications/unsubscribe-auth/AJTDD7UPC2ZGCXDNTNWU3SLPTOUNJANCNFSM4HKLSG2A .

-- Hannah Frank Postdoctoral Research Fellow Dept. of Pathology, Stanford University hkfrank@stanford.edu

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/NBISweden/IgDiscover/issues/97?email_source=notifications&email_token=ACCRULJGRUYOW5GDYJLUWATPXRSFXA5CNFSM4HKLSG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKTMSA#issuecomment-496318024, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACCRULJARI46PQBHWPD76GTPXRSFXANCNFSM4HKLSG2A.

marcelm commented 5 years ago

Hi Hannah, I’d like to look at this problem, but by default IgDiscover uses all available CPU cores which has the unfortunate side effect that the error message that it prints is not very helpful to me. Can you please run IgDiscover with a single core only using igdiscover run -j 1 and then show us the full error message?

marcelm commented 5 years ago

Ok, no need to send me the file since I was now able to reproduce the problem myself. The problem appears to be that IgDiscover cannot deal with spaces in the FASTA headers.

I’ll fix this, but for now, you can circumvent the problem by modifying the FASTA headers. When you have a header like this:

>gi|261498762|gb|GQ923630.1| Myotis lucifugus isolate My23 immunoglobulin heavy chain variable region (IGVH) gene, partial cds

Just remove the first space and everything following it:

    >gi|261498762|gb|GQ923630.1|

And since the names are quite long anyway, perhaps you want to shorten them even further while you’re at it.

You can use this sed command to automatically trim the headers (for example from V.fasta):

sed -r 's|>([^ ]*) .*|>\1|' V.fasta > new-V.fasta

Double-check the new-V.fasta file that it creates before you use it.

marcelm commented 5 years ago

I’ve fixed the problem now in the development version of IgDiscover. You can either continue to use the workaround described above or you can install the development version.

Perhaps it’s also time for us to make a new release, but it is going to take a couple of days until I have time.

carollia commented 5 years ago

Hi again,

Thank you for your insights and help. In my case it was not actually the relative proportion of Igs or detectable CDR3s (which were very high), nor was it an issue of spaces in the names of the reference fasta file. (I had already removed them.) However, your suggestion to run it with the flag of -j 1 to get more informative errors did help. There were warnings about the FR1 regions of some sequences not being evenly divisible by 3, which I guess interferes with the ability of the program to detect the various regions of the Ig sequence. I went back and made sure that all of my reference Vs were in frame, trimmed to be divisible by 3 (so each aa was translated and there were no extra nucleotides) and there were no stop codons in the reference genes. This seemed to fix the problem as it seems to have run successfully.

Thanks so much for your help!!!

Cheers, Hannah

On Tue, May 28, 2019 at 5:22 AM Marcel Martin notifications@github.com wrote:

I’ve fixed the problem now in the development version of IgDiscover. You can either continue to use the workaround described above or you can install the development version.

Perhaps it’s also time for us to make a new release, but it is going to take a couple of days until I have time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/IgDiscover/issues/97?email_source=notifications&email_token=AJTDD7W2VJWYVN7246K5IU3PXUPYZA5CNFSM4HKLSG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWL6HKI#issuecomment-496493481, or mute the thread https://github.com/notifications/unsubscribe-auth/AJTDD7WWK6KXN4EBDYX43F3PXUPYZANCNFSM4HKLSG2A .

-- Hannah Frank Postdoctoral Research Fellow Dept. of Pathology, Stanford University hkfrank@stanford.edu

marcelm commented 5 years ago

Great to hear! If you have a suggestion of how to improve IgDiscover such that others don’t run into the same problem, please let us know.

If you see more problems, just open a new issue.