NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

IgBLAST mangles sequence identifiers that are accessions #100

Closed carollia closed 5 years ago

carollia commented 5 years ago

Hello,

I reported this issue before and it resolved but it has returned and I can't figure out how to fix it. In a nutshell, it appears that IgBlast won't work possibly because of a key error with the V reference. I have pasted the error message (which resulted from using the -j 1 flag so I could get a more informative error) at the end of this issue.

Background: I am using a dataset of all IgMs that has worked before. I had 2,528,832 sequences, of which 866,289 were long enough and 354,443 dereplicated sequences were written so I think my dataset is sufficiently large. (And as I said it has worked before.) The main thing I changed were the reference genes -- my D and J references have sequences that the program says are duplicated but it just said it would ignore those. For the V gene reference, I made sure there are no spaces in the file, no "." characters in the names and no name is over 35 characters long. There are no "N"s in any of the sequences and all of the sequences are productive (no stop codons), in frame, integer multiples of 3bp (so whole number aa and no excess bps) and all of the framework regions are also integer multiples of 3bp. The sequences are derived from closely related species to the reads I'm trying to run IgDiscover on. Notably the sequence that it's citing as a key error was in the V reference that worked. My V reference file is quite large (710 sequences). Could this be causing an issue?

Thanks for your insights!

Cheers, Hannah

Traceback (most recent call last): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/bin/igdiscover", line 12, in sys.exit(main()) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/main.py", line 108, in main args.func(args) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 375, in main raw_output=raw_output, use_cache=use_cache): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 344, in igblast for igblast_output, igblast_records in pool.imap(runner, chunks, chunksize=1): File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/utils.py", line 159, in imap yield func(i) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/igblast.py", line 228, in call records = list(parser) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/parse.py", line 631, in iter yield self._parse_record(record_lines, fasta_record) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/parse.py", line 726, in _parse_record junction=junction) File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/parse.py", line 283, in init self.hits['V'] = self._fixed_v_hit() File "/home/hkfrank/Applications/miniconda3/envs/igdiscover/lib/python3.6/site-packages/igdiscover/parse.py", line 395, in _fixed_v_hit reference = self._database.v[hit.subject_id] KeyError: 'gb|GQ923684|'

marcelm commented 5 years ago

Hannah, do you think you could make the data available to me so I can try to reproduce the problem myself? This would be the fastest way to find where the problem is. I would need the reads, the input database, and your igdiscover.yaml file if you have modified it. My e-mail adress is in my GitHub profile. If you cannot share the reads, the database file would be some help as well. If you cannot share that either, one way forward would be that you install the development version of IgDiscover and see whether the error persists.

(For reference: The previous issue was #97.)

To answer your question: The size of the database should not be an issue. Also, N bases shouldn’t be a problem and even the messages about out-of-frame regions should not make the program crash – at least that was the intention. Only dots and spaces are problematic in the IgDiscover version that you use.

marcelm commented 5 years ago

Thanks for sending me the data (I have deleted it now).

As it turns out, IgBLAST (or rather BLAST, on which it is based) attempts to recognize if a sequence identifier is an accession and then modifies it to fit into its naming scheme. So even if the sequence is named “GQ923684” in the database FASTA file, a match against “gb|GQ923684|” is reported by IgBLAST. Because IgDiscover cannot handle this, it then crashes.

To work around this, you will need to modify the GQ accessions in your FASTA file to not look like accessions anymore. I used search and replace to change >GQ into >gb_GQ, which worked.

I’ll try to find a proper solution within IgDiscover so this is handled without needing to change the sequence identifiers.

marcelm commented 5 years ago

This problem is now fixed in the development version of IgDiscover. In the next release, you will no longer manually have to modify sequence ids.