dportik / dadi_pipeline

An accessible and flexible tool for fitting demographic models with dadi using custom or published models (available here), conducting goodness of fit tests, and plotting.
GNU Lesser General Public License v3.0
63 stars 30 forks source link

Question about the simulations #10

Closed spflanagan closed 3 years ago

spflanagan commented 5 years ago

Hi Daniel,

First, thank you for this excellent pipeline! I'm currently running simulations for a 2D model and have two questions:

  1. I've noticed that there is a lot of variability in the analysis time for the simulations, ranging anywhere from 1.5 hours to over 24 hours. Is this expected behavior for the program?
  2. If I need to stop the simulations partway through (e.g., after 30 simulation runs out of 100) is it possible to start them back up and re-start the numbering? Or do I need to move the initial ~30 runs to another directory and start the simulations again, but only run 70 simulations? Or is it not recommended to interrupt the simulations at all?

All the best, Sarah

dportik commented 5 years ago

Hi Sarah, I'm glad you have found it useful for your work. I have some answers to your questions below:

  1. Is this the time for one replicate, or the full 100 replicates? In general, the simulations should be a little faster than the empirical fitting. However, it is relative and more complex models will always take longer to fit. I would consider how the simulations compare to the empirical optimization routine first (in terms of time to completion), and if it is much longer there may be something else going on.

  2. You could manually change the starting number in the Optimize_Function_GOF.py script if you want to pick up right where you left off. For your specific example after running 30, leave the number of sims at 100 and just change the starting point in line 359:

for i in range(1,(sims+1)):

to

for i in range(31,(sims+1)):

and the analysis will pick up at 31, and continue to 100. However, it is likely an additional header line will be inserted into the output summary file (Simulation_Results.txt) and you will need to remove that before using the R plotting script.

If simulations are not running fast enough you could also run four instances of the script in different directories (e.g., with 25 simulations each), then combine the output files manually. I don't think this would interfere with the R plotting, but you could always manually renumber the simulations in the output file if it does. Because dadi does not use multithreading, the fastest way to speed up the simulations is to break them into smaller jobs and use all the cores you have available. Make sense?