MrOlm / drep

Rapid comparison and dereplication of genomes
247 stars 36 forks source link

Python error at the dereplicate step #29

Closed Rafael12692 closed 6 years ago

Rafael12692 commented 6 years ago

Hi. First of all, thanks for developing such a useful program! I'm trying to run the dereplicate step but the checkM step is failing:

$ dRep dereplicate outdR/ -g Final/*.fasta


..:: dRep dereplicate Step 1. Filter ::..

Will filter the genome list

Calculating genome info of genomes 100.00% of genomes passed length filtering Running prodigal Past prodigal runs found- will not re-run Running checkM !!! checkM failed !!! If using pyenv, make sure both python2 and python3 are available (for example: pyenv global 3.5.1 2.7.9)

However, I have already set the pyenv global parameter: $ pyenv global 3.5.1 2.7.9

Everything else looks fine: $ dRep bonus testDir --check_dependencies Loading work directory Checking dependencies mash.................................... all good (location = /usr/local/bin/mash) nucmer.................................. all good (location = /usr/bin/nucmer) checkm.................................. all good (location = /usr/local/bin/checkm) ANIcalculator........................... all good (location = /usr/bin/ANIcalculator_v1/ANIcalculator) prodigal................................ all good (location = /home/linuxbrew/.linuxbrew/bin/prodigal) centrifuge.............................. all good (location = /usr/local/bin/centrifuge)

Any idea what might be causing this problem? I don't have much experience in bioinformatics, so this is giving me a lot of headache.

MrOlm commented 6 years ago

Hello,

This is very interesting... I've never encountered this before, where checkM says it's working with --check_dependencies and then fails when running the program.

A couple of things to try. First, check the log file (in logger/log.log). In that file will be the exact command that dRep tried to run with checkM. Try and run that command on your own

Second, next time you run the program, run it with the -d parameter. This will produce the actual output that checkM gave when failing, which will be useful for troubleshooting.

Finally, what does the help show if you just run checkM -h?

Best, -Matt

Rafael12692 commented 6 years ago

Hello Matt,

Thank for the answer and for helping me. I'm sending the outputs from your suggestions. I have a feeling that solution for my problem might be on the topic 2.2:

1) This is what is written on log.log (I excluded the name of the fasta files just to make it shorter): 04-12 20:52 DEBUG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 04-12 20:52 DEBUG Logger started up at /home/linuxbrew/.linuxbrew/bin/outdR/log/logger.log 04-12 20:52 DEBUG Command to run dRep was: /home/rvpopin/.pyenv/versions/3.5.1/bin/dRep dereplicate outdR/ --debug -p 3 -g Final/*.fasta 04-12 20:52 DEBUG dRep version 2.0.5 was run

04-12 20:52 DEBUG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

04-12 20:52 DEBUG Namespace(MASH_sketch=1000, N50_weight=0.5, P_ani=0.9, S_algorithm='ANImf', S_ani=0.99, SkipMash=False, SkipSecondary=False, cent_index=None, checkM_method='lineage_wf', clusterAlg='average', completeness=75, completeness_weight=1, contamination=25, contamination_weight=5, cov_thresh=0.1, coverage_method='larger', debug=True, genomeInfo=None, genomes=['*.fasta'], length=50000, n_PRESET='normal', noQualityFiltering=False, operation='dereplicate', overwrite=False, percent='50', processors=3, run_tax=False, size_weight=0, strain_heterogeneity_weight=1, tax_method='percent', warn_aln=0.25, warn_dist=0.25, warn_sim=0.98, work_directory='outdR/') 04-12 20:52 DEBUG Starting the dereplicate operation 04-12 20:52 INFO *** ..:: dRep dereplicate Step 1. Filter ::..


04-12 20:52 DEBUG Loading work directory in filter 04-12 20:52 DEBUG Located: /home/linuxbrew/.linuxbrew/bin/outdR Datatables: [] Cluster files: [] Arguments: [] 04-12 20:52 DEBUG Validating filter arguments 04-12 20:52 INFO Will filter the genome list 04-12 20:52 INFO Calculating genome info of genomes 04-12 20:52 DEBUG Filtering genomes by size 04-12 20:52 INFO 100.00% of genomes passed length filtering 04-12 20:52 DEBUG Running CheckM 04-12 20:52 INFO Running prodigal 04-12 20:52 INFO Past prodigal runs found- will not re-run 04-12 20:52 INFO Running checkM 04-12 20:52 DEBUG Running CheckM with command: ['/usr/local/bin/checkm', 'lineage_wf', '/home/linuxbrew/.linuxbrew/bin/outdR/data/prodigal/', '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/', '-f', '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir//results.tsv', '--tab_table', '-t', '3', '--pplacer_threads', '3', '-g', '-x', 'faa'] 04-12 20:52 DEBUG Running CheckM with command: ['/usr/local/bin/checkm', 'qa', '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/lineage.ms', '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/', '-f', '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/Chdb.tsv', '-t', '3', '--tab_table', '-o', '2'] 04-12 20:52 ERROR !!! checkM failed !!! If using pyenv, make sure both python2 and python3 are available (for example: pyenv global 3.5.1 2.7.9)

2) 6 new output were produced when I run the program using the -d parameter:

2.1) 2018-04-12_20.52.27.650992.CMD:

/usr/local/bin/checkm lineage_wf /home/linuxbrew/.linuxbrew/bin/outdR/data/prodigal/ /home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/ -f /home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir//results.tsv --tab_table -t 3 --pplacer_threads 3 -g -x faa

2.2) 2018-04-12_20.52.27.650992.STDERR:


[CheckM - tree] Placing bins in reference genome tree.


Identifying marker genes in 38 bins with 3 threads: Process Process-2:ssing 0 of 38 (0.00%) bins. Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python2.7/dist-packages/checkm/markerGeneFinder.py", line 122, in processBin hmmModelFile = markerSetParser.createHmmModelFile(binId, markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 330, in createHmmModelFile markerFileType = self.markerFileType(markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 430, in markerFileType with open(markerFile, 'r') as f: IOError: [Errno 2] No such file or directory: u'/usr/local/lib/python2.7/dist-packages/checkm/hmms/phylo.hmm' Process Process-3: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python2.7/dist-packages/checkm/markerGeneFinder.py", line 122, in processBin hmmModelFile = markerSetParser.createHmmModelFile(binId, markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 330, in createHmmModelFile markerFileType = self.markerFileType(markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 430, in markerFileType with open(markerFile, 'r') as f: IOError: [Errno 2] No such file or directory: u'/usr/local/lib/python2.7/dist-packages/checkm/hmms/phylo.hmm' Process Process-4: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python2.7/dist-packages/checkm/markerGeneFinder.py", line 122, in __processBin hmmModelFile = markerSetParser.createHmmModelFile(binId, markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 330, in createHmmModelFile markerFileType = self.markerFileType(markerFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 430, in markerFileType with open(markerFile, 'r') as f: IOError: [Errno 2] No such file or directory: u'/usr/local/lib/python2.7/dist-packages/checkm/hmms/phylo.hmm'

Saving HMM info to file.

Calculating genome statistics for 38 bins with 3 threads: Finished processing 38 of 38 (100.00%) bins.

Extracting marker genes to align. [Error] Models must be parsed before identifying HMM hits. Traceback (most recent call last): File "/usr/local/bin/checkm", line 709, in checkmParser.parseOptions(args) File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 1253, in parseOptions self.tree(options) File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 156, in tree os.path.join(options.out_folder, 'storage', 'tree') File "/usr/local/lib/python2.7/dist-packages/checkm/hmmerAligner.py", line 104, in makeAlignmentToPhyloMarkers resultsParser.parseBinHits(outDir, hmmTableFile, False, bIgnoreThresholds, evalueThreshold, lengthThreshold) File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 1213, in parseOptions if options.bVerbose: AttributeError: 'Namespace' object has no attribute 'bVerbose'

2.3)2018-04-12_20.52.27.650992.STDOUT

Unexpected error: <type 'exceptions.AttributeError'>

2.4)cat 2018-04-12_20.52.30.154535.CMD

/usr/local/bin/checkm qa /home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/lineage.ms /home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/ -f /home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/Chdb.tsv -t 3 --tab_table -o 2

2.5)2018-04-12_20.52.30.154535.STDERR


[CheckM - qa] Tabulating genome statistics.


Calculating AAI between multi-copy marker genes.

Reading HMM info from file. Traceback (most recent call last): File "/usr/local/bin/checkm", line 709, in checkmParser.parseOptions(args) File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 1243, in parseOptions self.qa(options) File "/usr/local/lib/python2.7/dist-packages/checkm/main.py", line 396, in qa binIdToModels = markerSetParser.loadBinModels(hmmModelInfoFile) File "/usr/local/lib/python2.7/dist-packages/checkm/markerSets.py", line 537, in loadBinModels with gzip.open(filename, 'rb') as f: File "/usr/lib/python2.7/gzip.py", line 34, in open return GzipFile(filename, mode, compresslevel) File "/usr/lib/python2.7/gzip.py", line 94, in init fileobj = self.myfileobj = builtin.open(filename, mode or 'rb') IOError: [Errno 2] No such file or directory: '/home/linuxbrew/.linuxbrew/bin/outdR/data/checkM/checkM_outdir/storage/checkm_hmm_info.pkl.gz'

2.6)2018-04-12_20.52.30.154535.STDOUT

Unexpected error: <type 'exceptions.IOError'>

3)checkM -h shows the following: ...::: CheckM v1.0.11 :::...

Lineage-specific marker set: tree -> Place bins in the reference genome tree tree_qa -> Assess phylogenetic markers found in each bin lineage_set -> Infer lineage-specific marker sets for each bin

Taxonomic-specific marker set: taxon_list -> List available taxonomic-specific marker sets taxon_set -> Generate taxonomic-specific marker set

Apply marker set to genome bins: analyze -> Identify marker genes in bins qa -> Assess bins for contamination and completeness

Common workflows (combines above commands): lineage_wf -> Runs tree, lineage_set, analyze, qa taxonomy_wf -> Runs taxon_set, analyze, qa

Bin QA plots: bin_qa_plot -> Bar plot of bin completeness, contamination, and strain heterogeneity

Reference distribution plots: gc_plot -> Create GC histogram and delta-GC plot coding_plot -> Create coding density (CD) histogram and delta-CD plot tetra_plot -> Create tetranucleotide distance (TD) histogram and delta-TD plot dist_plot -> Create image with GC, CD, and TD distribution plots together

General plots: nx_plot -> Create Nx-plots len_plot -> Cumulative sequence length plot len_hist -> Sequence length histogram marker_plot -> Plot position of marker genes on sequences par_plot -> Parallel coordinate plot of GC and coverage gc_bias_plot -> Plot bin coverage as a function of GC

Sequence subspace plots: cov_pca -> PCA plot of coverage profiles tetra_pca -> PCA plot of tetranucleotide signatures

Bin exploration and modification: unique -> Ensure no sequences are assigned to multiple bins merge -> Identify bins with complementary sets of marker genes bin_compare -> Compare two sets of bins (e.g., from alternative binning methods) bin_union -> [Experimental] Merge multiple binning efforts into a single bin set modify -> [Experimental] Modify sequences in a bin outliers -> [Experimental] Identify outlier in bins relative to reference distributions

Utility functions: unbinned -> Identify unbinned sequences coverage -> Calculate coverage of sequences tetra -> Calculate tetranucleotide signature of sequences profile -> Calculate percentage of reads mapped to each bin join_tables -> Join tab-separated value tables containing bin information ssu_finder -> Identify SSU (16S/18S) rRNAs in sequences

Use 'checkm data setRoot' to specify the location of CheckM database files.

Usage: checkm -h for command specific help

MrOlm commented 6 years ago

Hello,

Thanks for the detailed debug information. I do think you're right that 2.2 seems to lay out the problem, which seems to be related to checkm being out of date.

I found someone else with a similar problem here: https://github.com/Ecogenomics/CheckM/issues/72

There's two things you could try. First, try the setRoot command. So that would be checkm data setRoot /some/location/to/store/files/. This will download the files checkM needs to run, and tell checkM where they're downloaded.

If that doesn't work, I would try updating checkM. The way to do this (according to https://github.com/Ecogenomics/CheckM/wiki/Installation#how-to-install-checkm ) is:

> sudo pip install checkm-genome --upgrade --no-deps
> sudo checkm data update

Best, -Matt

Rafael12692 commented 6 years ago

Hello Matt,

I tried to use the setRoot command and it solved the problems. I had downloaded checkM data but I forgot to tell the program where they were. Thank you for helping!

Best regards