cortes-ciriano-lab / savana

Somatic structural variant caller for long-read data
Apache License 2.0
46 stars 2 forks source link

ValueError: node array from the pickle has an incompatible dtype #27

Open waltergallegog opened 1 year ago

waltergallegog commented 1 year ago

Hello,

Using the latest version from bioconda, I'm getting the error

  File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/classify.py", line 238, in classify_by_model
      loaded_model = pickle.load(open(args.model, "rb"))
    File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
    File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
  ValueError: node array from the pickle has an incompatible dtype:
  - expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
  - got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

From https://discuss.streamlit.io/t/valueerror-node-array-from-the-pickle-has-an-incompatible-dtype/46682/4 it seems the error could be due to incompatible scikit-learn versions.

The following log seems to agree with this idea:

  /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/sklearn/base.py:347: InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.2.2 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
  https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

I installed savana using mamba and the bioconda channel:

savana                    1.0.3              pyhdfd78af_0    bioconda
scikit-learn              1.3.0           py310hf7d194e_0    conda-forg

The current requirement in the bioconda package recepie for savana is:

depends scikit-learn:    >=1.2.2

Perhaps the recipe needs to be updated, or the ont model re-pickled with the new version of scikit-learn. Thanks Walter.

Here is the full log:

  Version 1.0.3 - beta
  Source: /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/savana.py

  Running as sample ERR2752452.merged.sorted.aligned
  Using genome.fa.fai as reference fasta index
  Using multiprocessing with 20 threads

 Submitting 172 "get_potential_breakpoints" tasks to 20 worker threads
  Identified potential breakpoints        6092.765 seconds
  Clustered potential breakpoints         252.014 seconds
  Called consensus breakpoints            208.91 seconds
  Length after: 150775
  Total breakpoints: 150775 (27337 insertions)
  Using 685 as binsize, there are 429 redistributed intervals
  Max binsize 1177, min binsize 1
  Setting maxtasksperchild to 8
  /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/train.py:37: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
    data_matrix[['TUMOUR_DP_0', 'TUMOUR_DP_1']] = data_matrix['TUMOUR_DP'].apply(pd.Series)
  /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/train.py:38: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
    data_matrix[['NORMAL_DP_0', 'NORMAL_DP_1']] = data_matrix['NORMAL_DP'].apply(pd.Series)
  /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/sklearn/base.py:347: InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.2.2 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
  https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
    warnings.warn(
  Traceback (most recent call last):
    File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/bin/savana", line 10, in <module>
  Added local depth to breakpoints        5023.865 seconds
  Output consensus breakpoints            118.051 seconds
  Total time to call raw variants         11762.333 seconds

  Using ONT somatic only model to classify variants
  First time using model - will untar /mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/models/ont-somatic.tar.gz
  Loaded raw breakpoints                  87.857 seconds
      sys.exit(main())
    File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/savana.py", line 303, in main
      args.func(args)
    File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/savana.py", line 174, in savana_main
      savana_classify(args)
    File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/savana.py", line 117, in savana_classify
      classify.classify_by_model(args, checkpoints, time_str)
    File "/mnt/trcanmed/wgallego/pipesomatic/work/conda/env-8bb8117cce1d50927eeef0fd0720738e/lib/python3.10/site-packages/savana/classify.py", line 238, in classify_by_model
      loaded_model = pickle.load(open(args.model, "rb"))
    File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
    File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
  ValueError: node array from the pickle has an incompatible dtype:
  - expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
  - got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]
MartinezRuiz-Carlos commented 1 year ago

Hi all, exact same issue here

waltergallegog commented 1 year ago

@MartinezRuiz-Carlos if it helps, I was able to run savana after downgrading the scikit-learn version to 1.2.2 in my conda env.

MartinezRuiz-Carlos commented 1 year ago

Ah nice, I am trying to just run it on a manual install, seems to be doing allright so far, but will give it a go if I get into the same issue, thanks!

helrick commented 1 year ago

Hi there, thanks @waltergallegog for the detailed description of this issue. I've opened a pull request on the bioconda-recipes repo that should pin scikit-learn to 1.2.x and prevent conda/mamba from installing 1.3.0 or greater. I'll update here once it's been merged and I've tested that it works correctly.