OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

Wrong phred score base with low quality Nanopore data #39

Closed duceppemo closed 1 year ago

duceppemo commented 1 year ago

I'm playing with confidr and Nanopore data right now and get an error from bbduk_bait.

Traceback (most recent call last):
  File "/home/bioinfo/miniconda3/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 1067, in confindr
    fasta=args.fasta)
  File "/home/bioinfo/miniconda3/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 647, in find_contamination
    returncmd=True, threads=threads)
  File "/home/bioinfo/miniconda3/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait
    out, err = run_subprocess(cmd)
  File "/home/bioinfo/miniconda3/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess
    raise subprocess.CalledProcessError(x.returncode, cmd=command)
subprocess.CalledProcessError: Command 'bbduk.sh in=/home/bioinfo/analyses/bacillus_gta_cfia_run1/filtered/BA-12-InfantFormula_filtered.fastq.gz outm=/home/bioinfo/analyses/bacillus_gta_cfia_run1/qc/confindr/BA-12-InfantFormula_filtered/trimmed.fastq.gz ref=/media/36tb/db/rMLST_2022-08-10/Bacillus_db.fasta threads=48' returned non-zero exit status 1.

Running the bbduk command to have a more explicit error message gave:

Input is being processed as unpaired
Started output streams: 0.039 seconds.
Changed from ASCII-33 to ASCII-64 on input Z: 90 -> 59

The ASCII quality encoding offset (64) is not set correctly, or the reads are corrupt; quality value below -5.
Please re-run with the flag 'qin=33', 'ignorebadquality', or '-da'.
Problematic read number 0:

The error is caused because the quality scores are low and bbduk sets automatically the phred score base to 64 instead of 33. Maybe it adding an option to set the phred score scheme would help resolve that issue. Or you could simply hard code qin=33 in all the bbduk commands since most usable sequencing data are using a 33-based phred score.

pcrxn commented 1 year ago

Hi @duceppemo, thanks for pointing this out—we'll implement one of those two options into the upcoming version of ConFindr (probably the hard-coding).

pcrxn commented 1 year ago

Resolved by ac3b976.