bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

HDBSCAN models require consistent version of sklearn to open #213

Closed victoriapascal closed 2 years ago

victoriapascal commented 2 years ago

Currently trying to install poppunk version 2.3.0.

I used to have a conda installation with poppunk 2.3.0 and poppunk_sketch 1.7.4. I created my own poppunk database to be able to run poppunk_assign, poppunk --fit-model and poppunk_visualise. Now I need to re-create a conda environment that includes poppunk but it seems that when installing this same version poppunk_sketch is not included in the installation, which I believe is needed to run poppunk_assign. I attach here a file with the exact error I get when trying to run the commands mentioned above (see poppunk_sketch_error.txt). I have also tried installing other versions in case that would solve the error but I also get an error that it's shown in the other file (see poppunk_2.4_error_message.txt, not sure this is related at all but just in case), I also tried installing it via pip, copying the poppunk_sketch exe... but nothing seems to work. I would really appreciate some help in here.

Thanks a lot for the help in advance!

Victoria

poppunk_2.4_error_message.txt poppunk_sketch_error.txt

johnlees commented 2 years ago

I'm not immediately sure what's going wrong here unfortunately. It seems like the version string from pp-sketchlib isn't as expected.

Can you run poppunk_sketch in your installation? If you start a python session and run import pp_sketchlib what happens?

Copying out some relevant parts for reference:

'/bin/sh: 1: poppunk_sketch: not found\nTraceback (most recent call last):\n  File "/opt/conda/bin/poppunk_assign", line 10, in <module>\n    sys.exit(main())\n  File "/opt
/conda/lib/python3.9/site-packages/PopPUNK/assign.py", line 389, in main\n    dbFuncs = setupDBFuncs(args, args.min_kmer_count, qc_dict)\n  File "/opt/conda/lib/python3.9/si
te-packages/PopPUNK/utils.py", line 58, in setupDBFuncs\n    version = checkSketchlibVersion()\n  File "/opt/conda/lib/python3.9/site-packages/PopPUNK/sketchlib.py", line 49
, in checkSketchlibVersion\n    version = line.rstrip().decode().split(" ")[1]\nIndexError: list index out of range\n'

Looks like both sketchlib 2.0.0 and poppunk 2.4.0 are installed

johnlees commented 2 years ago

Note: a similar error appears in #210

victoriapascal commented 2 years ago

Thanks for the quick answer. poppunk_sketch seems not be installed. In the /conda/bin/ there are different executables (poppunk, poppunk_assign, poppunk_prune....) and *.py files (poppunk_add_weights.py, poppunk_batch_mst.py...) but not poppunk_sketch. I am able to import pp_sketchlib without problems though.

johnlees commented 2 years ago

And conda list in that environment shows pp-sketchlib >=2.0.0?

victoriapascal commented 2 years ago

Yes, that's the case (see list of packages attached).

conda_list.txt

johnlees commented 2 years ago

Could you try making a fresh environment with conda create -n pp_retry poppunk==2.4.0 pp-sketchlib==2.0.0 and see if you have any luck there?

victoriapascal commented 2 years ago

A new installation still gives me an error (see attached file). In this env, poppunk is installed but not poppunk_sketch still.

pp_fresh_install.txt

johnlees commented 2 years ago

Ah, apologies, I now see the problem. From pp-sketchlib 2.0.0 poppunk_sketch was renamed to sketchlib. If you instead install pp-sketchlib v1.7.4 it should work ok.

From v2.5.0 this will be updated and fixed so they work together.

The only thing I don't understand is why you aren't getting the version from the library file. Could you try running, in a python session:

import pp_sketchlib
pp_sketchlib.version
dir(pp_sketchlib)
victoriapascal commented 2 years ago

Indeed, installing pp-sketchlib v1.7.4 makes poppunk_sketch available but still I get an error when I run poppunk_assign (see attached file). I also attach another file to show you what I get from running the import and the other commands in my python session. How do you run poppunk_assign in poppunk version 2.4 then?

pp_assign_error.txt python_import.txt

johnlees commented 2 years ago

That's a different error now, which appears to be caused by scikit-learn changing their API. Can you try downgrading to v0.24? I'll need to put in a fix for this in future versions

johnlees commented 2 years ago

Would you be able to attach the fit.pkl file you are using here?

victoriapascal commented 2 years ago

Thanks! Downgrading to v0.24 solved the issue indeed. I attach here the pkl I'm using for this run. vanAB_dataset_updated.dists.pkl.zip

johnlees commented 2 years ago

Ok, glad to hear this sorted the issue!

The above pickle I think is the sample labels/dists pickle, do you also have a _fit.pkl you could share so I can look into the error?

victoriapascal commented 2 years ago

Do you mean this one? vanAB_dataset_updated_fit.pkl.zip

johnlees commented 2 years ago

Ok that's great thank you, I can now replicate

johnlees commented 2 years ago

Just to state the problem and resolution here: sklearn changed it's API from v0.24 -> v1.0 so that loading a HDBSCAN model created with a different version won't work. So loading an older HDBSCAN fit with a newer sklearn installation throws an error as reported above:

ModuleNotFoundError: No module named 'sklearn.neighbors._dist_metrics'

Most distributed models don't use this mode, so I don't forsee this being a big problem. I will add a note to the documentation that to use such a model the sklearn version needs to be downgraded, or that you may generally want to run refine model to give a simpler & faster model in the first place. We could write a script to convert HDBSCAN models to the new version of the API, but this would require some digging into the pickle, and it's not immediately clear to me from the sklearn docs what they changed and how to update the dist_metrics part. Will do this only if it keeps cropping up and downgrading sklearn is no longer viable.