OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

KeyError #52

Closed jonathho closed 1 year ago

jonathho commented 1 year ago

Hello,

I am able to successfully run confindr v0.8.1 with the test dataset, but when running with my own data I receive the following:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 842, in read_contig
    filtered_reads, qualities = characterise_read(column=column,
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 364, in characterise_read
    quality = fastq_records[read_name].letter_annotations["phred_quality"][read.query_position]
KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/project/6048996/virtual_env/confindr/bin/confindr", line 8, in <module>
    sys.exit(main())
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 243, in main
    confindr(args)
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 87, in confindr
    find_contamination(pair=fastq,
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 1504, in find_contamination
    for multibase_dict, report_write in p.starmap(read_contig,
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2'

Any thoughts?

pcrxn commented 1 year ago

Hi @jonathho,

For debugging, could you please include:

If you haven't already, could you also please re-run ConFindr on your samples with --verbosity debug and include the error?

If you're also able to attach some minimal FASTQ data of ConFindr failing on your samples, this could also be helpful.

Thanks,

Liam

jonathho commented 1 year ago

Hi @pcrxn,

Thanks for getting back so quick.

I installed ConFindr in a virtual environment with pip along with downloading and adding the dependencies to my PATH. The command I used to run was:

confindr --threads 8 \
    --rmlst \
    -d /path/to/confindr_db \
    -i /path/to/fastqDir \
    -o /path/to/outputDir

Running with the debug verbosity gives:

Traceback (most recent call last):
  File "/project/6048996/virtual_env/confindr/bin/confindr", line 8, in <module>
    sys.exit(main())
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 243, in main
    confindr(args)
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 87, in confindr
    find_contamination(pair=fastq,
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 1489, in find_contamination
    multibase_dict, report_write = read_contig(
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 842, in read_contig
    filtered_reads, qualities = characterise_read(column=column,
  File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 364, in characterise_read
    quality = fastq_records[read_name].letter_annotations["phred_quality"][read.query_position]
KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2'

Unfortunately I don't think I can provide a useful fastq sample as I get a different error when trying down-sized versions of my data. Please let me know if there's anything else that could be helpful to you though!

Cheers!

pcrxn commented 1 year ago

Ok, thanks @jonathho! Which version of Biopython is in your ConFindr virtual environment?

You can obtain this with the following Python code after activating your environment:

import Bio
print(Bio.__version__)
jonathho commented 1 year ago

It is on biopython==1.81.

It may be worth noting that the error is reproducible when running other similar samples, I get the same key error, just holding a different value. So instead of KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2' I will get KeyError: 'HISEQ102_308:1:1115:27580:61116/1/2'

Thank you!

pcrxn commented 1 year ago

Hi @jonathho,

I just tested ConFindr v0.8.1 when installed within a Python virtual environment, and I've confirmed that it's working with both the test data and some other genomes that I had lying around. I suspect that the issue may be related to the formatting of the headers in your FASTQ files, but I can't be sure without seeing them, unfortunately.

Could you perhaps paste the FASTQ records (forward and reverse) which correspond to the sequence raised by the KeyError (HISEQ102_308:1:1110:15625:64703/1/2)? I'd just like to see how the complete header looks.

Thanks!

jonathho commented 1 year ago

Sure thing.

Cheers!

pcrxn commented 1 year ago

Hey @jonathho, the issue is due to the index sequences being present in the headers of your FASTQ files (1:N:0:ATTCGAGG+ACAAGCTC for R1 and 2:N:0:ATTCGAGG+ACAAGCTC for R2)!

We can add a future fix to accommodate FASTQ files that have this format, but for now, you'll have to remove the index sequences from the headers manually. You should be able to do this using sed. I would recommend making a backup of your files, and then use this to edit the headers in-place:

sed -i 's/\(@HISEQ.*\) 1:N:0:ATTCGAGG+ACAAGCTC/\1/g' *R1.fastq
sed -i 's/\(@HISEQ.*\) 2:N:0:ATTCGAGG+ACAAGCTC/\1/g' *R2.fastq

You could optionally skip the -i flag in the commands above to print the results to STDOUT first, and then add -i again once the headers look good to you.

jonathho commented 1 year ago

Thank you so much @pcrxn!

I still got the KeyError, but after removing the /1 or /2 from the metadata line as well, ConFindr has been working for me across multiple samples I've tested. This does make my sequence identifier lines exactly the same between my read pairs, eg. @HISEQ102_308:1:1110:15625:64703, but otherwise this is great.

Cheers!

pcrxn commented 1 year ago

Thank you so much @pcrxn!

I still got the KeyError, but after removing the /1 or /2 from the metadata line as well, ConFindr has been working for me across multiple samples I've tested. This does make my sequence identifier lines exactly the same between my read pairs, eg. @HISEQ102_308:1:1110:15625:64703, but otherwise this is great.

Cheers!

You're welcome @jonathho! Happy ConFinding 👍