Open kustowski1 opened 1 year ago
I started work on this last week but I think we're going to have to revisit this at a later date (once the new merlin status command is released). Here's what I found so far, though:
--concurrency 1
option fixes the issue but will this scale? Unsure
--prefetch-multiplier
option (i.e. using the default of 4) seems to make this work a little better but not every sample restarts stillEverything here is leading me to believe this is a celery issue but we'll see if the new status command can provide us with more information.
Here are some links that may be helpful for this issue going forward:
This will require more research but the discussion of these user issues seems to be similar to the problem here. The issue may be due to ETA/countdown with celery tasks:
This is my general understanding of the problem so far.
Thank you @lucpeterson for the links.
With multiple workers and long simulation run times, only half of the workers wake up after the step restart.
I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:
merlin run dummy_simulation.txt sbatch workers.txt
and, as soon as the job has started running and the tmp_restarttest*/dummy/ subdirectories have been created, type
more tmp_restarttest*/dummy//*log
and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.
Description of the workflow: -run a "sleep 1" simulation -restart (after a one-second delay) -print a line including "Restarting" -run a "sleep 200" simulation -terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.
However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.
dummy_simulation.txt workers.txt