Closed andrese52 closed 4 years ago
Hi Andres,
The problem is with sd. The sd is the sd of log normal distribution, instead of the whole distribution. So you will need to convert it according to wiki. if you sd is too large, it will generate some extremely long or short sequences, and then will be discarded because they are longer than the genome size or smaller than the minimum threshold.
Let me know if you have further questions.
Chen
Hi Chen,
Yes, may you please provide a working example in such cases? The default examples in the README.md do not include -med
or -sd
.
Say we want a median of 8000, what -sd
would you suggest when having a genome size of 10kb to be simulated?
Thank you Andres
Sorry for the late reply. The standard deviation is independent of genome size, and it purely depends on how much you want the reads to spread. I'd suggest -sd
to be 1.05 or 1.1 to start with.
Hi @cheny19,
I also had this similar issue. Compared to default setting, simulator.py taking very long to run in the setting of -med 20000 -sd 4
. I am trying to stimulate reads with median=20kb and std=10kb. I would appreciate it if you could advise.
Many thanks, Hsin
@HLHsieh Can you please report your exact command?
@kmnip
I executed the following
~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x -med 20000 -sd 4 --seed 100 -b guppy
My goal is to simulate reads with distribution of median=20kb and std=10kb.
I also tried to execute that command with the default value of median and std, and it went smoothly.
~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x --seed 100 -b guppy
Please advise. Thanks!
Hi @kmnip,
I would like to follow up on this issue. Any suggestions would be appreciated.
PS. My version is 3.1.0.
Best, Hsin
@HLHsieh Let's continue in your other thread: https://github.com/bcgsc/NanoSim/issues/210
I did the characterization with E. coli and an SRA run from NCBI. Then, I used that generated profile in simulator.py. Everything works well if no -med and -sd are used. However, when I want a median of 8000 and sd of 200, the simulation gets stuck and takes very long. After a few hours, it uses all RAM and the job is killed by our HPC scheduler.
See below the code being used:
Any advice is greatly appreciated.