NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Complain if the input reads contain non-nucleotide characters/spaces #114

Closed bridgelan closed 10 months ago

bridgelan commented 3 years ago

It seems that "igdiscover dereplicate" is not processing any sequence. I installed IgDiscover through Conda in a Ubuntu 18.04 LTS. The log printed on Terminal and my configuration file are attached here.

igdiscover.yaml.txt Terminal_IgDiscoverError.txt

The error seems to be here:

`igdiscover dereplicate --json=stats/groups.json --minimum-length=300 reads/4-trimmed.fasta.gz | pigz > reads/sequences.fasta.gz INFO: 0 sequences processed INFO: 0 sequences long enough INFO: 0 dereplicated sequences written INFO: CPU time 00:00:01.3. Maximum memory usage 0.200 GB [Sun Jul 4 21:17:48 2021] Finished job 45. 1 of 44 steps (2%) done

[Sun Jul 4 21:17:48 2021] rule igdiscover_igblast: input: reads/sequences.fasta.gz, iteration-01/database/V.fasta, iteration-01/database/D.fasta, iteration-01/database/J.fasta output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json jobid: 52 wildcards: dir=iteration-01 threads: 8

time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M'_ --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz INFO: Running IgBLAST on database sequences to find CDR/FR region locations INFO: Running IgBLAST on input reads Traceback (most recent call last): File "/home/bridgelan/miniconda3/envs/igdiscover/bin/igdiscover", line 10, in sys.exit(main()) File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/main.py", line 95, in main to_run() File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/main.py", line 93, in to_run = lambda: subcommand(args) File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 414, in main logger.info('Processed {:10,d} sequences at {:.1f} ms/sequence'.format(n, elapsed / n * 1E3)) ZeroDivisionError: float division by zero

real 0m7,959s user 0m7,799s sys 0m0,575s [Sun Jul 4 21:17:56 2021] Error in rule igdiscoverigblast: jobid: 52 output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json shell: time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M' --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job igdiscover_igblast since they might be corrupted: iteration-01/assigned.tab.gz Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/bridgelan/Documentos/testV3D0M/.snakemake/log/2021-07-04T211746.710417.snakemake.log Total CPU time: 0h 0.17m ERROR:

Before this error, I downloaded the most recent human database from IMGT, and I removed from the .fasta files all the alleles not present in the "human.ndm.imgt" file. I tried to run the test set with this database, and it worked.

The structure of my Ig sequences is: forward barcode - IGHV primer (beginning of V-REGION) - VDJC - IGHC primer (beginning of CH1 exon) - reverse barcode. I assembled the sequences and trimmed the barcodes and the primer regions before the IgDiscover run. As a precaution, I also inserted the primer sequences into the configuration file. Then, the message about trimming problems appeared, and I created the ./stats/trimming-successful file. Then, this error appeared.

I've analyzed this file with IgBlast before, and it seems non-corrupted: it detected 90.000 productive sequences in it. Is this too small for IgDiscover? And is this the error cause?

Thanks in advance.

marcelm commented 3 years ago

Thanks for providing a full log file; that makes it easier to find out what is going on!

The problem appears to be the primer trimming step in IgDiscover. IgDiscover uses Cutadapt to remove them and it also uses the --discard-untrimmed option which makes it throw away all reads that do not contain a primer. Since you have already removed primer sequences, all reads are thrown away. You can see that in the log file:

Total reads processed:                 281,290
Reads with adapters:                         0 (0.0%)
Reads written (passing filters):             0 (0.0%)

Creating the stats/trimming-succesful file by hand unfortunately cannot solve the problem because IgDiscover will then just try to continue with an empty file.

IgDiscover should work correctly if you just give it the unprocessed reads. It will then merge paired-end reads and remove primers. If you want to continue to do these steps yourself, you need to not provide any primers. That is, give an empty list of primers in for the forward_primers: and reverse_primers: settings in the configuration.

marcelm commented 3 years ago

I've analyzed this file with IgBlast before, and it seems non-corrupted: it detected 90.000 productive sequences in it. Is this too small for IgDiscover?

It’s a bit low compared to what we had available when developing IgDiscover (roughly one million input reads, I don’t remember how many of those were productive), but this means only that you may not be able to discover some rarely expressed germline sequences. You should definitely get some results.

bridgelan commented 3 years ago

Thanks for the really fast reply! Just to confirm if I understood it well:

And the empty list should have some empty line, like

forward_primers:
- 
reverse_primers:
- 

or is completely empty, like

forward_primers:
reverse_primers:

?

bridgelan commented 3 years ago

I ran it again, with the "completely empty" option, and it seems the cutadapt problem was solved:

igdiscover dereplicate --json=stats/groups.json --minimum-length=300 reads/4-trimmed.fasta.gz | pigz > reads/sequences.fasta.gz
INFO: 281290 sequences processed
INFO: 157215 sequences long enough
INFO: 98092 dereplicated sequences written
INFO: CPU time 00:00:03.1. Maximum memory usage 0.323 GB

But now, a new error appeared, haha.

time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M'_ --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz
INFO: Running IgBLAST on database sequences to find CDR/FR region locations
INFO: Running IgBLAST on input reads
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 226, in __call__
    records = list(parser)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/parse.py", line 706, in __iter__
    yield self._parse_record(record_lines, fasta_record)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/parse.py", line 786, in _parse_record
    assert qsequence == full_sequence[hit.query_start:hit.query_start+len(qsequence)]
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bridgelan/miniconda3/envs/igdiscover/bin/igdiscover", line 10, in <module>
    sys.exit(main())
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/__main__.py", line 95, in main
    to_run()
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/__main__.py", line 93, in <lambda>
    to_run = lambda: subcommand(args)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 392, in main
    for record in igblast(database, sequences, sequence_type=args.sequence_type,
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 359, in igblast
    for igblast_output, igblast_records in pool.imap(runner, chunks, chunksize=1):
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
AssertionError
Error memory mapping:/tmp/tmp18lnhlkp/J.nsq openedFilesCount=30 threadID=2
Error memory mapping:/tmp/tmp18lnhlkp/D.nhr openedFilesCount=29 threadID=2
WORKER: T2 BATCH # 36 CEXCEPTION: Attempt to access NULL pointer.
WORKER: T2 BATCH # 36 CEXCEPTION: Attempt to access NULL pointer.
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33

real    0m48,464s
user    1m27,708s
sys 0m0,974s
[Mon Jul  5 10:16:17 2021]
Error in rule igdiscover_igblast:
    jobid: 52
    output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json
    shell:
        time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M'_ --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job igdiscover_igblast since they might be corrupted:
iteration-01/assigned.tab.gz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/bridgelan/Documentos/testV3D0M/.snakemake/log/2021-07-05T101517.688811.snakemake.log
Total CPU time: 0h 2.00m
ERROR: 

The problem seems to be coming from genes not annotated in "human.ndm.imgt", correct? Or does it have to do with the database I'm using? Should I open a new issue for this?

marcelm commented 3 years ago

I’m not actually at work at the moment, so I don’t have time for fully debugging this right now, perhaps tonight. In any case, you can ignore the second part of the stacktrace (the part where you got the WORKER: ... messages), that is just an error that IgBLAST throws because it got interrupted due to the actual error. IgDiscover runs IgBLAST with a custom database, so whether things are annotated in human.ndm.imgt or not does not matter, so that shouldn’t be the problem.

The problem is here:

assert qsequence == full_sequence[hit.query_start:hit.query_start+len(qsequence)]

and I will need to look into the code to understand why that assertion is there.

If possible, it would be helpful if I had access to the contents of the iteration-01/database/ folder and your reads/sequences.fasta.gz. You can send the files to me by e-mail (my address is on my GitHub profile), and I promise not to share the data and to delete the files as soon as possible.

You don’t need to open a new issue at the moment.

bridgelan commented 3 years ago

I sent you the Google Drive to the folder containing the data you requested.

I decided to sent you the full folder, in the case it's necessary to have them.

I'm really thankful for your help, Marcel!

At the same time, I'll annotate this sequences with IgBlast to confirm the data integrity again.

marcelm commented 3 years ago

Thanks a lot for the files, this has made it a lot easier to find out what is going on.

The problem is that your input reads contain spaces within the sequence. IgBLAST ignores them, but IgDiscover does not. As a sanity check, IgDiscover tests whether the query sequence that it sends to IgBLAST is the same as the one reported back from IgBLAST, and that fails because of the extra space character.

May I ask which tool generated that file? If that is a commonly used tool, it may be worth supporting spaces in reads in IgDiscover (but it’s unlikely I have the time to implement this anytime soon).

For now, I suggest you either run the tool that you used in such a way that it does not insert the spaces, or that you manually remove the spaces from the files that you already have. Here’s a sed command for this (that only works if each FASTA record occupies exactly two lines):

zcat reads.fasta.gz | sed '2~2s/ //' | pigz > reads-nospace.fasta.gz
bridgelan commented 3 years ago

The tool is not commonly used, it's a specific piece of a VDJ pipeline code that I'm using to correct the sequences through barcodes, coming from this article.

I will reexamine the input data and the pipeline code to understand what's inserting these spaces. I don't think it's worth your time to support this feature in IgDiscover, haha: I'm afraid that this correction script is generating unwanted modifications in the sequences. It might be interesting to raise a specific Python error in this case, but I don't know if it's a priority, space characters are really unexpected in the middle of a FASTA sequence line (until now, haha).

Marcel, I'm sincerely thankful for your time to help me, man. Keep up the good work!

marcelm commented 3 years ago

You’re welcome! I will leave this issue open for now until I have made a decision of whether IgDiscover should complain about spaces in the input sequences.

marcelm commented 10 months ago

This repository is outdated and is going to be archived. Please see the new repository at https://gitlab.com/gkhlab/igdiscover22/ or the homepage at https://www.igdiscover.se/ for the most recent and maintained IgDiscover version.