Step restart may not work with many workers and long run times

kustowski1 commented 1 year ago

With multiple workers and long simulation run times, only half of the workers wake up after the step restart.

I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:

merlin run dummy_simulation.txt sbatch workers.txt

and, as soon as the job has started running and the tmp_restarttest*/dummy/ subdirectories have been created, type

more tmp_restarttest*/dummy//*log

and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.

Description of the workflow: -run a "sleep 1" simulation -restart (after a one-second delay) -print a line including "Restarting" -run a "sleep 200" simulation -terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.

However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.

dummy_simulation.txt workers.txt

bgunnar5 commented 1 year ago

I started work on this last week but I think we're going to have to revisit this at a later date (once the new merlin status command is released). Here's what I found so far, though:

Problem persists with the latest version of celery (5.2.7)
Adding a restart command doesn't fix it
Removing the --concurrency 1 option fixes the issue but will this scale? Unsure
- I believe this is why removing the worker entirely fixed the issue since the default worker uses a higher concurrency
Removing the --prefetch-multiplier option (i.e. using the default of 4) seems to make this work a little better but not every sample restarts still
Changing the retry delay doesn't resolve this issue
Every sample in the log says it's restarting even though roughly half of them do not

Everything here is leading me to believe this is a celery issue but we'll see if the new status command can provide us with more information.

bgunnar5 commented 1 year ago

Here are some links that may be helpful for this issue going forward:

This will require more research but the discussion of these user issues seems to be similar to the problem here. The issue may be due to ETA/countdown with celery tasks:

Tasks are living in the queue with countdown set and a prefetch multiplier of 1
A task is picked up by a celery worker
That task takes a long time to complete so the countdown timer may expire for other tasks in the queue (which can't be completed since the prefetch-multiplier is 1)
The long running task eventually needs to retry but the worker keeps it in it's own memory rather than releasing it back to the queue
Another worker frees up but since the retried task is in the memory of the other worker it can't be fetched by this free worker so it never gets completed

This is my general understanding of the problem so far.

Thank you @lucpeterson for the links.

LLNL / merlin

Step restart may not work with many workers and long run times #418