cov-lineages / pangoLEARN

Store of the trained model for pangolin to access.
GNU General Public License v3.0
55 stars 13 forks source link

Error in rule pangolearn after 04/09/2022 update #84

Closed egenomics closed 2 years ago

egenomics commented 2 years ago

Hi, We have been using pangolin (through conda) for a while now. With the last pangolearn update our pipeline broke. We are using pangolin: 3.1.20 pangolearn: 2022-04-09 constellations: v0.1.7 scorpio: 0.3.16 pango-designation used by pangoLEARN/Usher: v1.3 pango-designation aliases: 1.6

We get the following error:

All dependencies satisfied. The query file is:/datos/MiSeq/MICRO/COVID/analysis/2022_04_19_R2247/consensus/consensus.R2247.fna Running sequence QC Number of sequences detected: 48 Total passing QC: 44

Data files found: Trained model: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/decisionTree_v1.joblib Header file: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/decisionTreeHeaders_v1.joblib Designated hash: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/lineages.hash.csv

Job stats: job count min threads max threads


add_failed_seqs 1 1 1 align_to_reference 1 1 1 all 1 1 1 generate_report 1 1 1 get_constellations 1 1 1 hash_sequence_assign 1 1 1 pangolearn 1 1 1 scorpio 1 1 1 total 8 1 1

loading model 04/19/2022, 14:24:50 /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.0.1 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk. warnings.warn( processing block of 44 sequences 04/19/2022, 14:24:51 [Tue Apr 19 14:24:52 2022] Error in rule pangolearn: jobid: 0 output: /tmp/tmpz2w2ggj4/lineage_report.pass_qc.csv

RuleException: AttributeError in line 112 of /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk: 'DecisionTreeClassifier' object has no attribute 'nfeatures' File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk", line 112, in __rule_pangolearn File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/pangolearn/pangolearn.py", line 170, in assign_lineage File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 922, in predict_proba File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 395, in _validate_X_predict File "/root/miniconda3/envs/pangolin_test/lib/python3.8/concurrent/futures/thread.py", line 57, in run Exiting because a job execution failed. Look above for error message Exiting because a job execution failed. Look above for error message

corneliusroemer commented 2 years ago

This looks like a duplicate of: https://github.com/cov-lineages/pangolin/issues/427

I've encountered this error before, too. Try to reinstall your environment.

This warning gives a hint about a possible reason: you may not be using the sklearn version expected. Setting up a fresh environment with pangolin should fix this.

Let me know if reinstalling doesn't solve the problem and then share information on the exact packages and their versions installed in your environment.

wm75 commented 2 years ago

@aineniamh @corneliusroemer I think this issue deserves reopening.

This error comes from the fact that apparently the most recent pangoLEARN models have been built using a more recent version of scikit-learn. The bioconda recipe for pangolin 3.1.20 has its scikit-learn dependency pinned to 0.23.1 (https://github.com/bioconda/bioconda-recipes/blob/a574d43146db09006d462746aa1d8716c77404b4/recipes/pangolin/meta.yaml#L25) and due to internal changes in scikit-learn models dumped with versions > 1.0 will not load with that older version (compare https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations). Conversely when you're trying to load a model that got dumped with a pre-1.0 version of scikit-learn with a version > 1.0 you will see a warning like this one:

UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.24.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations

and though I have no idea whether the model would really be compromised that doesn't sound encouraging.

Since dumped scikit-learn models are generally not guaranteed to be reloadable with different versions, I think the bioconda approach of pinning a given pangolin release to a specific version of scikit-learn is the right thing to do, but it requires that:

For 3.1.20 I'm not sure what should be done now. Fact is that models since 2022-04-09 won't work with fresh conda installs of pangolin 3.1.20, but there's no simple fix I can see. The question is whether you'd want to switch back to building future models with scikit-learn 0.24 agaiiin as you did previously?

More importantly, however, the same logic holds for pangolin v4 and its pangoLEARN part of pangolin-data, too. Again, it would be good to have the scikit-learn version clearly stated, and most importantly not changing unnecessarily.

wm75 commented 2 years ago

@egenomics a solution to fix your issue (without updating to pangolin 4) is to:

This will enable you to run recent models of pangoLEARN with your pangolin. However, you'll see the UserWarning above when trying to run with older models.

aineniamh commented 2 years ago

I'd like to just give a warning that when we released pangolin 4.0, I intended to maintain pangoLEARN for a couple of months before phasing it out. This was just to give a buffer zone of time for people to update to pangolin 4.0. It's been about 5 weeks, so bear in mind that this repository won't be maintained much longer!

I think this is a good point about scikit-learn versions though, as this is relevant to the random forest model too (you don't see the warnings in 4.0 but the same thing exists that people's local version of scikit-learn may be different to what we've trained on). We can specify a particular version of scikit-learn if this might be an issue, but I've never noticed the version of scikit-learn effecting the inference from the model.

corneliusroemer commented 2 years ago

Thanks @wm75 for investigating and giving such a detailed description of what's behind the error here and in https://github.com/cov-lineages/pangolin/issues/427

The happy path is to use up to date pangolin models with up to date pangoLEARN models.

If for reproducibility one needs to use an old pangoLEARN model, one should use the corresponding pangolin version that was around at the time the model was trained.

@aineniamh Do I understand you correctly that pangoLEARN as a whole will be phased out?

aineniamh commented 2 years ago

Yeah as it's no longer needed in pangolin 4.0, I'll archive the repo at some point in the not too distant future.