Closed jonathho closed 1 year ago
Hi @jonathho,
For debugging, could you please include:
If you haven't already, could you also please re-run ConFindr on your samples with --verbosity debug
and include the error?
If you're also able to attach some minimal FASTQ data of ConFindr failing on your samples, this could also be helpful.
Thanks,
Liam
Hi @pcrxn,
Thanks for getting back so quick.
I installed ConFindr in a virtual environment with pip
along with downloading and adding the dependencies to my PATH.
The command I used to run was:
confindr --threads 8 \
--rmlst \
-d /path/to/confindr_db \
-i /path/to/fastqDir \
-o /path/to/outputDir
Running with the debug verbosity gives:
Traceback (most recent call last):
File "/project/6048996/virtual_env/confindr/bin/confindr", line 8, in <module>
sys.exit(main())
File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 243, in main
confindr(args)
File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/confindr.py", line 87, in confindr
find_contamination(pair=fastq,
File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 1489, in find_contamination
multibase_dict, report_write = read_contig(
File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 842, in read_contig
filtered_reads, qualities = characterise_read(column=column,
File "/project/6048996/virtual_env/confindr/lib/python3.10/site-packages/confindr_src/methods.py", line 364, in characterise_read
quality = fastq_records[read_name].letter_annotations["phred_quality"][read.query_position]
KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2'
Unfortunately I don't think I can provide a useful fastq sample as I get a different error when trying down-sized versions of my data. Please let me know if there's anything else that could be helpful to you though!
Cheers!
Ok, thanks @jonathho! Which version of Biopython is in your ConFindr virtual environment?
You can obtain this with the following Python code after activating your environment:
import Bio
print(Bio.__version__)
It is on biopython==1.81
.
It may be worth noting that the error is reproducible when running other similar samples, I get the same key error, just holding a different value. So instead of KeyError: 'HISEQ102_308:1:1110:15625:64703/1/2'
I will get KeyError: 'HISEQ102_308:1:1115:27580:61116/1/2'
Thank you!
Hi @jonathho,
I just tested ConFindr v0.8.1 when installed within a Python virtual environment, and I've confirmed that it's working with both the test data and some other genomes that I had lying around. I suspect that the issue may be related to the formatting of the headers in your FASTQ files, but I can't be sure without seeing them, unfortunately.
Could you perhaps paste the FASTQ records (forward and reverse) which correspond to the sequence raised by the KeyError (HISEQ102_308:1:1110:15625:64703/1/2)? I'd just like to see how the complete header looks.
Thanks!
Sure thing.
@HISEQ102_308:1:1110:15625:64703/1 1:N:0:ATTCGAGG+ACAAGCTC
@HISEQ102_308:1:1110:15625:64703/2 2:N:0:ATTCGAGG+ACAAGCTC
Cheers!
Hey @jonathho, the issue is due to the index sequences being present in the headers of your FASTQ files (1:N:0:ATTCGAGG+ACAAGCTC
for R1 and 2:N:0:ATTCGAGG+ACAAGCTC
for R2)!
We can add a future fix to accommodate FASTQ files that have this format, but for now, you'll have to remove the index sequences from the headers manually. You should be able to do this using sed
. I would recommend making a backup of your files, and then use this to edit the headers in-place:
sed -i 's/\(@HISEQ.*\) 1:N:0:ATTCGAGG+ACAAGCTC/\1/g' *R1.fastq
sed -i 's/\(@HISEQ.*\) 2:N:0:ATTCGAGG+ACAAGCTC/\1/g' *R2.fastq
You could optionally skip the -i
flag in the commands above to print the results to STDOUT first, and then add -i
again once the headers look good to you.
Thank you so much @pcrxn!
I still got the KeyError, but after removing the /1
or /2
from the metadata line as well, ConFindr has been working for me across multiple samples I've tested. This does make my sequence identifier lines exactly the same between my read pairs, eg. @HISEQ102_308:1:1110:15625:64703
, but otherwise this is great.
Cheers!
Thank you so much @pcrxn!
I still got the KeyError, but after removing the
/1
or/2
from the metadata line as well, ConFindr has been working for me across multiple samples I've tested. This does make my sequence identifier lines exactly the same between my read pairs, eg.@HISEQ102_308:1:1110:15625:64703
, but otherwise this is great.Cheers!
You're welcome @jonathho! Happy ConFinding 👍
Hello,
I am able to successfully run confindr v0.8.1 with the test dataset, but when running with my own data I receive the following:
Any thoughts?