bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Simulator script crashes when loading KDF pickle file #81

Closed jeizenga closed 3 years ago

jeizenga commented 4 years ago

I ran the read_analysis.py script followed by the simulator.py script in transcriptome mode. However, the simulator script crashes when trying to load the file suffixed with _unaligned_length.pkl. Here's the crash report:

2020-05-14 19:34:09: Read in reference
2020-05-14 19:34:17: Read in reference genome and create .fai index file
2020-05-14 19:34:17: Read in expression profile
2020-05-14 19:34:17: Read in IR markov model
2020-05-14 19:34:17: Read in GFF3 annotation file
2020-05-14 19:39:08: Read error profile
2020-05-14 19:39:08: Read KDF of unaligned reads
Traceback (most recent call last):
  File "../sim/NanoSim/src/simulator.py", line 1513, in <module>
    main()
  File "../sim/NanoSim/src/simulator.py", line 1503, in main
    read_profile(ref_g, ref_t, number, model_prefix, perfect, args.mode, strandness, exp, model_ir, "linear")
  File "../sim/NanoSim/src/simulator.py", line 414, in read_profile
    kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
  File "/public/groups/vg/jeizenga/rna/whole_genome_eval/long_reads/sim/nanosimenv/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 605, in load   
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/public/groups/vg/jeizenga/rna/whole_genome_eval/long_reads/sim/nanosimenv/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
    obj = unpickler.load()
  File "/public/home/anovak/.local/lib64/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
KeyError: '\x00'

Thanks!

cheny19 commented 4 years ago

It's an error that we have never met before. Could try to use python3 and see how it works?

fairliereese commented 4 years ago

I had a similar issue to this as well. The problem seems to be that joblib is imported from sklearn.externals in read_analysis.py and is imported on its own in simulator.py. I encourage the authors of the software to use only one of these libraries. Basically this causes a mismatch between the version of the software used the pickle (save; in read_analysis.py) these files, and unpickle (load; in simulator.py) them. I ended up replacing the import joblib line in simulator.py with from sklearn.externals import joblib and was able to get things to work at least beyond that error...

Reference

cheny19 commented 4 years ago

Hi @fairliereese ,

Thanks for pointing out the reason. I'm working on it now and will make a new release soon.

Chen

cheny19 commented 4 years ago

Hi @fairliereese and @jeizenga ,

I tried your strategy and used from sklearn.externals import joblib but it actually failed in my Python environment. I'm not sure how you managed to get it work. If you don't mind, please post your Python and sklearn version here so I can look more into this. However, since we saved the model files with previous versions of joblib, I'd keep the code as it used to be to avoid confusions. If we have time in the future to re-train all the models, we may unify the saving and loading in future releases.

To help users who may have similar problem as in this issue, I'll keep it open for easier search.

Thanks, Chen

Jingquan-Li commented 4 years ago

Hi @cheny19 , I had the same problems as they described when i ran the the simulator.py script in genome mode, and i replacing the import joblib line in simulator.py with from sklearn.externals import joblib then it worked. Here is my Python version: Python 2.7.15 | packaged by conda-forge | (default, Jul 2 2019, 00:39:44). sklearn version is '0.20.3'.

cheny19 commented 4 years ago

Thank you so much @Leejquan, I guess that's the problem. I was using Python 3.7 in my test. We will keep them consistent in the future when we have time to re-train all the models / providing new models.

fairliereese commented 4 years ago

I am running Python 3.6 and sklearn version 0.20.0 so I wouldn't attribute it solely to the Python version. I definitely recommend rerunning the models and making the import statements consistent across files for your next release though!

cheny19 commented 3 years ago

Hi @fairliereese, we have finally finished all the coding and testing to change the way of importing model files. We also have re-trained all the models, so we hope this problem is resolved in NanoSim v3.0.0 pre-release. Please give it a shot and let me know how it works for you. Thanks for waiting for so long.

MaxenceQueyrel commented 3 years ago

Hi @cheny19, I have tested the new version of NanoSim using CAMISIM and there is still one thing to correct I think. To run the script successfully I had to follow the recommendation of @fairliereese, changing "import joblib" with "from sklearn.externals import joblib" worked for me.

Here is my environment : python 3.6.10 scikit-learn 0.20.0 joblib 0.16.0

Best, Maxence.

cheny19 commented 3 years ago

Hi @MaxenceQueyrel ,

Thanks for your interest! We have resolved this issue and released a new version one NanoSim. In the new release, we unified all the import of joblib and resolved the problem. CAMISIM uses old NanoSim pre-trained models, which were saved through a different version of joblib, maybe that's why you had the problem. In the new version of NanoSim, we can model and simulate metagenomic reads, with pre-trained models of metagenomic reads (and new features specific to metagenome). You may give it a shot too.

Thanks, Chen

MaxenceQueyrel commented 3 years ago

Hi @cheny19, Thank you for your answer. Ok, I will try NanoSim with the newest version and let you know if I have an issue.