bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Whether NanoSim could generate Nanopore reads by simulation #87

Open Jingquan-Li opened 4 years ago

Jingquan-Li commented 4 years ago

Hi @cheny19, I noticed the discription of NanoSim was a Nanopore sequence read simulator. And I wonder harness a software to generate some Nanopore reads with given a genome or a fasta file . When I looking into the scripts of NanoSim , I failed to find such a script . I really hope for your help!

Thanks.

cheny19 commented 4 years ago

Yes, you just need to run simulator.py to simulate ONT reads. You can find the help info in the README.md file.

Jingquan-Li commented 4 years ago

Yes, you just need to run simulator.py to simulate ONT reads. You can find the help info in the README.md file.

If I just run simulator.py to simulate ONT reads without runing step one , I encountered this error : simulator.py genome -dna_type linear -rg 1M_12501.fa -c ssc_1M -max 90000 -min 20000 -n 1000 -t 6 Traceback (most recent call last): File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 1513, in main() File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 1422, in main read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness, None, False, dna_type) File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 270, in read_profile with open(model_prefix + "_strandness_rate", 'r') as strand_profile: IOError: [Errno 2] No such file or directory: 'ssc_1M_strandness_rate'

And I noticed only did I run read_analysis.py then could obtain the strandness_rate file.
So it confused me.

cheny19 commented 4 years ago

Right, if you want to use your own model, you have to run step1 first. However, if you don't want to train your own model, you can direct -c to our pre-trained model (provided in the package), and run simulator.py. You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

Jingquan-Li commented 4 years ago

Right, if you want to use your own model, you have to run step1 first. However, if you don't want to train your own model, you can direct -c to our pre-trained model (provided in the package), and run simulator.py. You just need to untar the pre-trained model, and specify the directory and prefix to -c option. I downloaded the human_NA12878_DNA_FAB49712_albacore.tar.gz you provided, then I run tar -xvzfhuman_NA12878_DNA_FAB49712_albacore.tar.gz` such a error occured: gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now

cheny19 commented 4 years ago

It happened to me yesterday as well. You'll need to clone the whole repo, or click into the pretrained model folder from Github and then click the model you want to use to download. It seems Github has some sort of issue that the file is broken if you right click to download directly.

Jingquan-Li commented 4 years ago

Thanks for your patient guideness! I have downloaded your trianed model. I encountered errors when I run ./NanoSim2.6.0/simulator.py genome -dna_type linear -rg 1M_12501.fa -c human_NA12878_DNA_FAB49712_albacore/training -max 90000 -min 20000 -n 1000

Traceback (most recent call last): File "./NanoSim2.6.0/simulator.py", line 1702, in main() File "./NanoSim2.6.0/simulator.py", line 1599, in main read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness, None, False, dna_type, None) File "./NanoSim2.6.0/simulator.py", line 411, in read_profile kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl") File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 605, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle obj = unpickler.load() File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/pickle.py", line 892, in load_proto raise ValueError, "unsupported pickle protocol: %d" % proto ValueError: unsupported pickle protocol: 3 It seems to that (model_prefix + "_unaligned_length.pkl") file was generated by Python3, but I loaded the data by Python2.7

cheny19 commented 4 years ago

This issue has been reported by other users in #81 , could you try from sklearn.externals import joblib instead of import joblib and see if it occurs?

cheny19 commented 3 years ago

Hi @Leejquan, We have an update on this issue. we have finally finished all the coding and testing to change the way of importing model files. We also have re-trained all the models, so we hope this problem is resolved in NanoSim v3.0.0 pre-release. Please give it a shot and let me know how it works for you. Thanks for waiting for so long.

RagnarGrootKoerkamp commented 2 years ago

You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

Could you update the readme example to include this information? Just trying human as written currently doesn't seem to work.

(Ideally, a path to somewhere inside the conda installation would be best, assuming they are already part of this.)

SaberHQ commented 2 years ago

You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

Could you update the readme example to include this information? Just trying human as written currently doesn't seem to work.

(Ideally, a path to somewhere inside the conda installation would be best, assuming they are already part of this.)

Please note that -c option in simulation stage specifies the location and prefix of error profiles generated from characterization step (Default = training). That human thing you mentioned from README file is a symbolic name referencing the trained models on human data.

-c MODEL_PREFIX, --model_prefix MODEL_PREFIX

For more information on parameters for each mode in training and simulation stage, you may run: read_analysis.py -h or simulator.py -h. There are five modes in read_analysis.py and three modes in simulator.py.

I will take a note to update the README file to make it clear.

zhanghaoyu9931 commented 2 years ago

Hell, nowadays I want to simulate some ONT reads from bacteria and virus genomes. However, I notice that your latest pre-trained models are trained on the human datasets, which may have different sequence patterns compared to bacteria ones. I am wondering, which pre-trained model should I use to get acceptable simulation results on my dataset?

SaberHQ commented 2 years ago

Hey @zhanghaoyu9931 I would highly recommend you to train your own model and use the trained profiles to simulate reads.

The README file is very informative and it will guide you through on how to run the training pipeline. It's fast and does not require high computing power. Please refer to following code for more information:

https://github.com/bcgsc/NanoSim/blob/master/src/read_analysis.py