bcgsc / NanoSim

Nanopore sequence read simulator
Other
233 stars 56 forks source link

simulator.py taking very long to run and RAM usage above 768 GB #76

Closed andrese52 closed 4 years ago

andrese52 commented 4 years ago

I did the characterization with E. coli and an SRA run from NCBI. Then, I used that generated profile in simulator.py. Everything works well if no -med and -sd are used. However, when I want a median of 8000 and sd of 200, the simulation gets stuck and takes very long. After a few hours, it uses all RAM and the job is killed by our HPC scheduler.

See below the code being used:

simulator.py genome -n 2700 -med 8000 -sd 200 -r test-10kb.fasta -o genome-10kb -c nanosim_profile_new/ecoli --seed 974839895 -t 32

Any advice is greatly appreciated.

cheny19 commented 4 years ago

Hi Andres,

The problem is with sd. The sd is the sd of log normal distribution, instead of the whole distribution. So you will need to convert it according to wiki. if you sd is too large, it will generate some extremely long or short sequences, and then will be discarded because they are longer than the genome size or smaller than the minimum threshold.

Let me know if you have further questions.

Chen

andrese52 commented 4 years ago

Hi Chen, Yes, may you please provide a working example in such cases? The default examples in the README.md do not include -med or -sd.

Say we want a median of 8000, what -sd would you suggest when having a genome size of 10kb to be simulated?

Thank you Andres

cheny19 commented 4 years ago

Sorry for the late reply. The standard deviation is independent of genome size, and it purely depends on how much you want the reads to spread. I'd suggest -sd to be 1.05 or 1.1 to start with.

HLHsieh commented 1 year ago

Hi @cheny19,

I also had this similar issue. Compared to default setting, simulator.py taking very long to run in the setting of -med 20000 -sd 4. I am trying to stimulate reads with median=20kb and std=10kb. I would appreciate it if you could advise.

Many thanks, Hsin

kmnip commented 1 year ago

@HLHsieh Can you please report your exact command?

HLHsieh commented 1 year ago

@kmnip

I executed the following

~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x -med 20000 -sd 4 --seed 100 -b guppy

My goal is to simulate reads with distribution of median=20kb and std=10kb.

I also tried to execute that command with the default value of median and std, and it went smoothly.

~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x --seed 100 -b guppy

Please advise. Thanks!

HLHsieh commented 2 months ago

Hi @kmnip,

I would like to follow up on this issue. Any suggestions would be appreciated.

PS. My version is 3.1.0.

Best, Hsin

kmnip commented 2 months ago

@HLHsieh Let's continue in your other thread: https://github.com/bcgsc/NanoSim/issues/210