OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

ConFindr :: problem with bbtools #30

Closed EricDeveaud closed 1 year ago

EricDeveaud commented 2 years ago

Hello while running the example test set with bbtoos version 38.01 with --rmlst (dunno if this make sense or not) option on I had the follwoing error message

tested with bbtools version bbmap/37.78 bbmap/38.91

2021-12-08 15:19:05  Encountered error when attempting to run ConFindr on sample example. Skipping... 
  2021-12-08 15:19:05  Error encounted was:
Traceback (most recent call last):
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 1051, in confindr
    find_contamination(pair=fastq,
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 623, in find_contamination
    out, err, cmd = bbtools.bbduk_bait(reference=sample_database,
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait
    out, err = run_subprocess(cmd)
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess
    raise subprocess.CalledProcessError(x.returncode, cmd=command)
subprocess.CalledProcessError: Command 'bbduk.sh in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56' returned non-zero exit status 1.

that was due to

rpm_maker:ConFindr/0.7.4 > bbduk.sh in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56
java -ea -Xmx52354m -Xms52354m -cp /opt/gensoft/exe/bbmap/38.91/libexec/current/ jgi.BBDuk in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56
Executing jgi.BBDuk [in=example-data/example_R1.fastq.gz, in2=example-data/example_R2.fastq.gz, outm=example-out/example/rmlst_R1.fastq.gz, outm2=example-out/example/rmlst_R2.fastq.gz, ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta, threads=56]
Version 38.91

Set threads to 56
0.038 seconds.
Initial:
Memory: max=52610m, total=52610m, free=47943m, used=4667m

java.lang.Exception: 
An input file appears to be misformatted:
The character with ASCII code 39 appeared where a base was expected: '''
Sequence #0
Sequence ID: 'BACT000001_10671'

regards

Eric

adamkoziol commented 2 years ago

Are you able to take a look at /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta to confirm that the file hasn't been corrupted (or something else is wrong with it)? BACT000001_10671 is the first entry in the file, and should be 1674 bp long.

adamkoziol commented 2 years ago

Also, just to confirm, you do have the credentials required to download the rMLST databases as mentioned in the docs?

EricDeveaud commented 2 years ago

hummmm

rpm_maker:ConFindr/ConFindr-0.7.4 > head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta
>BACT000001_10671
b'ATGACTGAATCTTTTGCTCAACTCTTTGAAGAGTCCTTAAAAGAAATCGAAACCCGCC
CGGGTTCTATCGTTCGTGGCGTTGTTGTTGCTATCGACAAAGATGTAGTACTGGTTGACG

where does b' come from ?

EricDeveaud commented 2 years ago

yes RMLST already downloaded and available in CONFINDR_DB

rpm_maker:ConFindr/ConFindr-0.7.4 > ls $CONFINDR_DB
Escherichia_db.fasta                   Listeria_db_cgderived.fasta
Escherichia_db_cgderived.fasta         Salmonella_db_cgderived.fasta
Escherichia_db_cgderived.fasta.fai     download_date.txt
Escherichia_db_cgderived_kma.comp.b    gene_allele.txt
Escherichia_db_cgderived_kma.length.b  profiles.txt
Escherichia_db_cgderived_kma.name      rMLST_combined.fasta
Escherichia_db_cgderived_kma.seq.b     refseq.msh
adamkoziol commented 2 years ago

Bytes. I wonder if the encoding has changed by default in one of the downloading or formatting libraries used to create those files.

adamkoziol commented 2 years ago

Can you check rMLST_combined.fasta to see if it is also in bytes?

EricDeveaud commented 2 years ago

yes bytes in python, but they should not appear in fasta files I'll try to check who's guilty. we can already skip downloading as original files are OK

yes rMLST_combined.fasta is also in bytes repr

EricDeveaud commented 2 years ago

which version of Python//Biopython are you using ? here Python/3.8.1 // biopython-1.79

EricDeveaud commented 2 years ago

trying right now with biopython-1.78

adamkoziol commented 2 years ago

Yes.... now I know why this sounded familiar. I believe that issue #27 is related.

EricDeveaud commented 2 years ago

yes I already have patched line 209 of database_setup.py

EricDeveaud commented 2 years ago

with biopython-1.78 no trouble

head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/rMLST_combined.fasta 
>BACT000001_1
ATGGAAAATTTTGCTCAGCTGTTGGAAGAAAGCTTTACCCTGCAAGAAATGAACCCGGGT
GAGGTGATTACCGCTGAAGTAGTGGCAATCGACCAAAACTTCGTTACCGTAAACGCAGGT

waiting for Escherichia_db.fasta to be generated

Escherichia_db.fasta OK too

confindr.py -i example-data -o example-out --rmlst also ran successfully

Cryphonectria commented 2 years ago

Hi, I'm getting a similar error message as Eric when running

db="path_to_confindr_db"
confindr.py -i confindr_test -o out_test -d $db --rmlst -t 10 -Xmx 4g

Error message:

Traceback (most recent call last):
  File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 1067, in confindr
    fasta=args.fasta)
  File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 638, in find_contamination
    returncmd=True)
  File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait
    out, err = run_subprocess(cmd)
  File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess
    raise subprocess.CalledProcessError(x.returncode, cmd=command)
subprocess.CalledProcessError: Command 'bbduk.sh in=confindr_test/F207_R1.fastq.gz in2=confindr_test/F207_R2.fastq.gz outm=out_test/F207/rmlst_R1.fastq.gz outm2=out_test/F207/rmlst_R2.fastq.gz ref=/home/schlae0003/GROUP/taxanomy_databases/confindr_db/rMLST_combined.fasta threads=10 Xmx=4g' returned non-zero exit status 1. 

However the confindr_log.txt says:

Command used: mash screen /home/schlae0003/GROUP/taxanomy_databases/confindr_db/refseq.msh confindr_test/F207_R1.fastq.gz confindr_test/F207_R2.fastq.gz  -p 10  -w  -i 0.85 | sort -gr > out_test/F207/screen.tab

STDERR: Loading /home/schlae0003/GROUP/taxanomy_databases/confindr_db/refseq.msh...
   1023303 distinct hashes.
Streaming from 2 inputs...
   Estimated distinct k-mers in mixture: 59041145
Summing shared...
Reallocating to winners...
Computing coverage medians...
Writing output...

refseq.msh seems to be a binary file in my case... I'm using confindr 0.7.4, mash 2.3 and BBMap 38.45

My confindr_db looks like this

download_date.txt 
gene_allele.txt 
profiles.txt 
rMLST_combined.fasta
Escherichia_db_cgderived.fasta 
Listeria_db_cgderived.fasta 
refseq.msh 
Salmonella_db_cgderived.fasta

Thanks, Lea

adamkoziol commented 2 years ago

I believe refseq.msh should be a binary file.

What versions of BioPython and Python are you using?

Can you run the BBduk command separately to see if there's any useful output? bbduk.sh in=confindr_test/F207_R1.fastq.gz in2=confindr_test/F207_R2.fastq.gz outm=out_test/F207/rmlst_R1.fastq.gz outm2=out_test/F207/rmlst_R2.fastq.gz ref=/home/schlae0003/GROUP/taxanomy_databases/confindr_db/rMLST_combined.fasta threads=10 Xmx=4g

Thanks, A

Cryphonectria commented 2 years ago

I'm using Python=3.7 and BioPython=1.78

When running the BBduk command separately, I realised that there was a memory issue. Normally this would give me a core dump or "out of memory" message on the cluster, but for some reason it didn't... I was able to fix the problem by going overboard with memory:threads=2 Xmx=64g (I'm working with bacteria, fastq.gz files around 500M). Now confindr also runs without errors.

Thanks!!!

pcrxn commented 1 year ago

hummmm

rpm_maker:ConFindr/ConFindr-0.7.4 > head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta
>BACT000001_10671
b'ATGACTGAATCTTTTGCTCAACTCTTTGAAGAGTCCTTAAAAGAAATCGAAACCCGCC
CGGGTTCTATCGTTCGTGGCGTTGTTGTTGCTATCGACAAAGATGTAGTACTGGTTGACG

where does b' come from ?

Fixed by 19d0d1d in v0.8.1.