OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

Error when running ConFindr on unusual oganisms #42

Closed cjdgug closed 1 year ago

cjdgug commented 1 year ago

Hi I find ConFindr with --rmlst works very nicely for me most of the time, however when I try to use it on reads for more uncommon species, it will encounter an error and skip that sample. I have noticed this for Neisseria cinerea, Corynebacterium pseudodiphtheriticum, Listeria grayi, and Klebsiella variicola; all genre where ConFindr works very well to identify to the genus level most of the time.

I am using ConFindr v0.7.4 biopython 1.78 python 3.7

Error encounted was: Traceback (most recent call last): File "/home/rhh/miniconda3/envs/py37/lib/python3.7/site-packages/confindr_src/confindr.py", line 1067, in confindr cross_details=args.cross_details) File "/home/rhh/miniconda3/envs/py37/lib/python3.7/site-packages/confindr_src/confindr.py", line 629, in find_contamination returncmd=True) File "/home/rhh/miniconda3/envs/py37/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait out, err = run_subprocess(cmd) File "/home/rhh/miniconda3/envs/py37/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess raise subprocess.CalledProcessError(x.returncode, cmd=command) subprocess.CalledProcessError: Command 'bbduk.sh in=220722_fastq/QC11_S11_L001_R1_001.fastq.gz in2=220722_fastq/QC11_S11_L001_R2_001.fastq.gz outm=220722_confindr2/QC11_S11_L001/rmlst_R1.fastq.gz outm2=220722_confindr2/QC11_S11_L001/rmlst_R2.fastq.gz ref=/home/rhh/.confindr_db/rMLST_combined.fasta threads=4' returned non-zero exit status 1.

Any advice?

pcrxn commented 1 year ago

Hi @cjdgug, could you please run that command shown below and paste the output here?

bbduk.sh in=220722_fastq/QC11_S11_L001_R1_001.fastq.gz in2=220722_fastq/QC11_S11_L001_R2_001.fastq.gz outm=220722_confindr2/QC11_S11_L001/rmlst_R1.fastq.gz outm2=220722_confindr2/QC11_S11_L001/rmlst_R2.fastq.gz ref=/home/rhh/.confindr_db/rMLST_combined.fasta threads=4
cjdgug commented 1 year ago

sure: BBDuk version 37.62 Set threads to 4 Initial: Memory: max=866m, free=842m, used=24m

java.lang.OutOfMemoryError: Java heap space at kmer.AbstractKmerTable.allocLong1D(AbstractKmerTable.java:150) at kmer.HashArray1D.resize(HashArray1D.java:149) at kmer.HashArray.setIfNotPresent(HashArray.java:190) at jgi.BBDukF$LoadThread.addToMap(BBDukF.java:1924) at jgi.BBDukF$LoadThread.addToMap(BBDukF.java:1828) at jgi.BBDukF$LoadThread.run(BBDukF.java:1755)

This program ran out of memory. Try increasing the -Xmx flag and setting prealloc.

when I run the same command with -Xmx8g I get a different error: BBDuk version 37.62 Set threads to 4 Initial: Memory: max=8232m, free=8060m, used=172m

/home/rhh/miniconda3/envs/py37/bin/bbduk.sh: line 324: 3410 Killed java -Djava.library.path=/home/rhh/miniconda3/envs/py37/bbtools/lib/jni/ -ea -Xmx8g -Xms8g -cp /home/rhh/miniconda3/envs/py37/bbtools/lib/current/ jgi.BBDukF in=220722_fastq/QC11_S11_L001_R1_001.fastq.gz in2=220722_fastq/QC11_S11_L001_R2_001.fastq.gz outm=220722_confindr2/QC11_S11_L001/rmlst_R1.fastq.gz outm2=220722_confindr2/QC11_S11_L001/rmlst_R2.fastq.gz ref=/home/rhh/.confindr_db/rMLST_combined.fasta threads=4 -Xmx8g

I was also experiencing the same error on a bigger machine when it was running confindr v0.7.0, but I have now updated that one to v0.7.4 too. The bigger machine does not encounter the error, but it does still list the Genus as "NA" in the report. It also gives me the rMLST and contamination output files which this machine is not doing with the error.

So I guess a work around is not to use it on this machine with hardly any memory.

cjdgug commented 1 year ago

I still had this error on one of my samples on the larger machine, but this was a rubbish sample with no reads. When I ran the bbduk.sh command that caused the error:

BBDuk version 37.62 Set threads to 24 Initial: Memory: max=25918m, free=25377m, used=541m

Added 846536239 kmers; time: 98.320 seconds. Memory: max=24797m, free=10311m, used=14486m

Input is being processed as paired Started output streams: 0.010 seconds. Processing time: 0.009 seconds.

Input: 16 reads 2040 bases. Contaminants: 2 reads (12.50%) 301 bases (14.75%) Total Removed: 2 reads (12.50%) 301 bases (14.75%) Result: 14 reads (87.50%) 1739 bases (85.25%)

Time: 98.350 seconds. Reads Processed: 16 0.00k reads/sec Bases Processed: 2040 0.00m bases/sec

pcrxn commented 1 year ago

sure: BBDuk version 37.62 Set threads to 4 Initial: Memory: max=866m, free=842m, used=24m

java.lang.OutOfMemoryError: Java heap space at kmer.AbstractKmerTable.allocLong1D(AbstractKmerTable.java:150) at kmer.HashArray1D.resize(HashArray1D.java:149) at kmer.HashArray.setIfNotPresent(HashArray.java:190) at jgi.BBDukF$LoadThread.addToMap(BBDukF.java:1924) at jgi.BBDukF$LoadThread.addToMap(BBDukF.java:1828) at jgi.BBDukF$LoadThread.run(BBDukF.java:1755)

This program ran out of memory. Try increasing the -Xmx flag and setting prealloc.

when I run the same command with -Xmx8g I get a different error: BBDuk version 37.62 Set threads to 4 Initial: Memory: max=8232m, free=8060m, used=172m

/home/rhh/miniconda3/envs/py37/bin/bbduk.sh: line 324: 3410 Killed java -Djava.library.path=/home/rhh/miniconda3/envs/py37/bbtools/lib/jni/ -ea -Xmx8g -Xms8g -cp /home/rhh/miniconda3/envs/py37/bbtools/lib/current/ jgi.BBDukF in=220722_fastq/QC11_S11_L001_R1_001.fastq.gz in2=220722_fastq/QC11_S11_L001_R2_001.fastq.gz outm=220722_confindr2/QC11_S11_L001/rmlst_R1.fastq.gz outm2=220722_confindr2/QC11_S11_L001/rmlst_R2.fastq.gz ref=/home/rhh/.confindr_db/rMLST_combined.fasta threads=4 -Xmx8g

I was also experiencing the same error on a bigger machine when it was running confindr v0.7.0, but I have now updated that one to v0.7.4 too. The bigger machine does not encounter the error, but it does still list the Genus as "NA" in the report. It also gives me the rMLST and contamination output files which this machine is not doing with the error.

So I guess a work around is not to use it on this machine with hardly any memory.

Indeed, it looks like Java is being killed because using -Xmx8g requests 8GB memory, which is the same as the total memory available on that machine. Because of other processes running in the background, the program is unable to access the full 8GB memory and is being killed. You could try running on that machine with -Xmx6g, for example, which might be enough memory for bbduk.sh to run without errors.

I still had this error on one of my samples on the larger machine, but this was a rubbish sample with no reads. When I ran the bbduk.sh command that caused the error:

BBDuk version 37.62 Set threads to 24 Initial: Memory: max=25918m, free=25377m, used=541m

Added 846536239 kmers; time: 98.320 seconds. Memory: max=24797m, free=10311m, used=14486m

Input is being processed as paired Started output streams: 0.010 seconds. Processing time: 0.009 seconds.

Input: 16 reads 2040 bases. Contaminants: 2 reads (12.50%) 301 bases (14.75%) Total Removed: 2 reads (12.50%) 301 bases (14.75%) Result: 14 reads (87.50%) 1739 bases (85.25%)

Time: 98.350 seconds. Reads Processed: 16 0.00k reads/sec Bases Processed: 2040 0.00m bases/sec

It looks like this command completed successfully. When you run ConFindr on the machine with larger memory, what error is ConFindr giving you?

cjdgug commented 1 year ago

Thanks For the rubbish sample, the error is the same as when it runs out of memory: at lines 1067, 629, 258, and 16 with the bbduk returning a non zero exit, except when I run the offending command on its own it seems to work fine. So it is odd that it errors, rather than just giving an NA output. Obv. I shouldn't even need to run it for this sample, so this is NBD. For the others, it is still strange that it will not list the genus level ID in the report, despite giving me a contamination status, 20K+ bases examined, and an rMLST profile output that I can use to give me a species level ID on the website. Again, this is NBD, because I can work around it.

pcrxn commented 1 year ago

Hi @cjdgug, would you be able to attach a link to one or more of the problematic FASTQ file pairs?

cjdgug commented 1 year ago

This is not my own data, but I tried some from SRA. This is a Listeria grayi which has the same issue, i.e. gives NA in the report for genus ID, but does identify to the genus and species level with rMLST: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR11343989&display=metadata

pcrxn commented 1 year ago

Hi @cjdgug, I'm sorry for the slow response!

I downloaded this genome and processed the FASTQ files using ConFindr v0.7.4. The genus is being reported as ND because 88/1000 shared-hashes of the L. grayi genome with L. monocytogenes are being detected using mash screen, and ConFindr looks for a cutoff of 150 shared-hashes by default (this can be adjusted using the -m/--min_matching_hashes flag). If you run ConFindr with the -k flag, you can keep the intermediate files produced by the pipeline, and inspect the 'screen.tab' file to see the number of shared-hashes being detected.

We changed ConFindr to report contamination even though an ND is being reported for the Genus so that users can individually decide whether to consider the SNVs being reported as contamination with caution. Hopefully this clarifies things!

pcrxn commented 1 year ago

Closing as a solution was provided.