AnantharamanLab / vRhyme

Binning Virus Genomes from Metagenomes
GNU General Public License v3.0
55 stars 9 forks source link

Scikit-learn version issue ```ValueError: node array from the pickle has an incompatible dtype``` #30

Closed erfanshekarriz closed 5 months ago

erfanshekarriz commented 5 months ago

Hello there.

I was pretty stoked to use vRhyme for my viral binning protocol, but unfortunately haven't been successful in running the program without any errors.

I initially wasn't able to supply my own sorted bam files that come from minimap2 -x sr (which is X3-4 times faster and also more accurate than bowtie2 - would recommend adding this as a mapping option). It would give me the same error as Issue #26 https://github.com/AnantharamanLab/vRhyme/issues/26 and would not produce the coverage table. I then gave up and thought to instead try out using the internal bowtie2 aligner but still ran into a different error.

This is the command I ran:

python workflow/software/vRhyme/vRhyme/vRhyme -i combined.viralcontigs.fa -r DRR093002_R1.fastq.gz DRR093002_R2.fastq.gz DRR093003_R1.fastq.gz DRR093003_R2.fastq.gz DRR093004_R1.fastq.gz DRR093004_R2.fastq.gz -l 2000 -t 32 -o deepsea/test_res/binning/viral/tmp/vrhyme/hydrothermal-vent-BMS --verbose

This time I checked thelog_vRhyme_paired_reads.tsv and the pairings are correct. I also checked and the vRhyme_coverage_values.tsv file is not empty .

Despite that I get the following log and error:

Date:     2024-01-18 (y-m-d)
Start:    14:32:05   (h:m:s)
Program:  vRhyme v1.1.0

Time (min) |  Log                                                   
--------------------------------------------------------------------
0.0           Initializing and validating vRhyme parameters
0.01          Paired end read file(s) identified. Running bowtie2 on 3 set of paired files
5.13          Extracting coverage information from BAM files
5.82          Coverage extraction complete. Generating coverage table
5.82          Performing pairwise coverage comparisons
5.86          Running Prodigal on filtered sequences
5.95          Generating codon usage features
5.95          Generating nucleotide features
5.99          Performing pairwise distance calculations
6.0           Performing machine learning classification
workflow/software/vRhyme/vRhyme/vRhyme:16: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
Traceback (most recent call last):
  File "workingdir/vRhyme/vRhyme/vRhyme", line 960, in <module>
    net_data = machine_stuff.machine_stuff(distances, presets, model_method, pairs_machine, cohen_machine, iterations, cohen_check)
  File "workingdir/vRhyme/vRhyme/scripts/machine_stuff.py", line 73, in machine_stuff
    model_ET = pickle.load(read_model_ET)
  File "sklearn/tree/_tree.pyx", line 728, in sklearn.tree._tree.Tree.__setstate__
  File "sklearn/tree/_tree.pyx", line 1434, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

Any idea on how we can resolve this? I was reading some blogs online saying it's related to the version of scikit-learn. If that is the case can you include the version of the software in the conda installation? This way we are guaranteed to fully reproduce your outcomes.

If you need my raw sequence files I'm happy to somehow send them to you. I can also send you the bam files generated from minimap2.

Best,

Erfan

erfanshekarriz commented 5 months ago

I've resolved this issue by enforcing the scikit-learn version:

mamba create -c bioconda -n vRhyme python=3 networkx pandas numpy numba scikit-learn==1.2.2 pysam samtools mash mummer mmseqs2 prodigal bowtie2 bwa

Please help me update the installation instructions in the READ.md file. I would also strongly recommend noting the versions of all software above to allow longterm stability and reproducibility.

Best,

Erfan