NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

error in_rule_read_length_histogram #122

Closed Jyoung-0108 closed 10 months ago

Jyoung-0108 commented 1 year ago

Hey,

I've been running IgDiscover (version 0.15.1) with my sequencing data, however, the run fails each time in this rule read_length_histogram. The log printed on Terminal and my configuration file are attached here.

log.txt igdiscover.txt

I also have another question: The structure of my Ig sequences is image and the 10X barcode is 16 nt-long, which the UMI is 10 nt-long. I'm wondering if I should set the 'barcode_length_5prime: ' to 10 or 16? I set it to 10. My sequencing dataset includes 2 sample and i think UMI represents the real UMI here in igdiscover, am i right about it? Thank you a lot!

marcelm commented 1 year ago

According to the log, the error appears to actually be in the igdiscover_igblast rule. If the below doesn’t fix the issue, can you show the contents of the file /lustre/home/acct-medsn/medsn/IMQ1_BCR/.snakemake/log/2023-07-06T155201.795624.snakemake.log mentioned in the very last line of the log?

Regading barcodes/UMIs: What IgDiscover calls "barcode" is actually the UMI, so your thinking is correct. However, IgDiscover expects the UMI to be at the 5' end of the read, so if you set barcode_length_5prime to 10, the first 10 bases of the 10X barcode would be interpreted as the UMI. I think this would mean that all reads from one GEM would be merged into a single one by IgDiscover. This might even explain the above igdiscover_igblast error.

My suggestion is to try setting barcode_length_5prime to 26 to make IgDiscover treat the combined 10X barcode plus UMI as UMI. (Disclaimer: I have not used IgDiscover myself with 10X data.)

Jyoung-0108 commented 1 year ago

According to the log, the error appears to actually be in the igdiscover_igblast rule. If the below doesn’t fix the issue, can you show the contents of the file /lustre/home/acct-medsn/medsn/IMQ1_BCR/.snakemake/log/2023-07-06T155201.795624.snakemake.log mentioned in the very last line of the log?

Regading barcodes/UMIs: What IgDiscover calls "barcode" is actually the UMI, so your thinking is correct. However, IgDiscover expects the UMI to be at the 5' end of the read, so if you set barcode_length_5prime to 10, the first 10 bases of the 10X barcode would be interpreted as the UMI. I think this would mean that all reads from one GEM would be merged into a single one by IgDiscover. This might even explain the above igdiscover_igblast error.

My suggestion is to try setting barcode_length_5prime to 26 to make IgDiscover treat the combined 10X barcode plus UMI as UMI. (Disclaimer: I have not used IgDiscover myself with 10X data.)

Thanks for your reply. And I've tried to corrected the barcode_length_5prime to 26 but it didn't fix my issue. I still got the problem. and i've attached the log file here. 2023-07-07T095124.994995.snakemake.log

Jyoung-0108 commented 1 year ago

hello,i've updated my issue and i'm waiting for your reply. Thanks!

marcelm commented 1 year ago

Can you please run igdiscover igblastwrap --sequence-type=Ig --threads=16 iteration-01/database reads/sequences.fasta.gz > /dev/null and paste the output here? There should be an error message.

Jyoung-0108 commented 1 year ago

Thanks! I ran the commands and got the following output.

INFO: Running IgBLAST on input reads Traceback (most recent call last): File "/lustre/home/acct-medsn/medsn/.conda/envs/igdiscover/bin/igdiscover", line 10, in sys.exit(main()) File "/lustre/home/acct-medsn/medsn/.conda/envs/igdiscover/lib/python3.10/site-packages/igdiscover/main.py", line 92, in main to_run() File "/lustre/home/acct-medsn/medsn/.conda/envs/igdiscover/lib/python3.10/site-packages/igdiscover/main.py", line 90, in to_run = lambda: module.main(args) File "/lustre/home/acct-medsn/medsn/.conda/envs/igdiscover/lib/python3.10/site-packages/igdiscover/cli/igblastwrap.py", line 122, in main "Processed {:10,d} sequences at {:.1f} ms/sequence".format(n, elapsed / n * 1e3) ZeroDivisionError: float division by zero

marcelm commented 1 year ago

Thanks, that is helpful. It appears that no reads remain after preprocessing. That is, IgBLAST works on zero input sequences. There are a couple of preprocessing steps and it is unclear to me at the moment which one filters out all reads. What are the contents of reads/2-pear.log?

Jyoung-0108 commented 1 year ago

I uploaded the log file here. 2-pear.log

marcelm commented 1 year ago

Ok, that step went fine. What are the contents of the files stats/trimmed.json and stats/groups.json?

marcelm commented 1 year ago

... and of reads/4-sequences.fasta.gz.log?

Jyoung-0108 commented 1 year ago

it seemed that all sequences were trimmed. groups.json: { "unique_barcodes": 0, "barcode_singletons": 0, "groups_written": 0, "group_size_1": 0, "group_size_2": 0, "group_size_3plus": 0 }

and trimmed.json: {"trimmed": 21081671}

4-sequences.fasta.gz.log

marcelm commented 1 year ago

The large number in trimmed.json is actually the number of sequences remaining after trimming, so it is good that it is high. ("Trimming" actually refers to primer removal, but since you haven’t configured any primers in your configuration, no trimming is done anyway.)

The problem appears to be in the "grouping" step where IgDiscover mainly groups together sequences that have the same UMI, but it also does some minor filtering. In particular, it filters out sequences that are too short, and it seems that in your case, it considers all sequences to be too short:

INFO: 21081671 sequences in input
INFO: 0 sequences long enough

Your configuration file has this:

minimum_merged_read_length: 300

So it appears that all your sequences end up as being shorter than 300 bp.

What is actually the read length of your 10X reads? Maybe the data isn’t suitable for IgDiscover.