LLNL / merlin

Machine Learning for HPC Workflows
MIT License
119 stars 26 forks source link

Step restart may not work with many workers and long run times #418

Open kustowski1 opened 1 year ago

kustowski1 commented 1 year ago

With multiple workers and long simulation run times, only half of the workers wake up after the step restart.

I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:

merlin run dummy_simulation.txt sbatch workers.txt

and, as soon as the job has started running and the tmp_restarttest*/dummy/ subdirectories have been created, type

more tmp_restarttest*/dummy//*log

and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.

Description of the workflow: -run a "sleep 1" simulation -restart (after a one-second delay) -print a line including "Restarting" -run a "sleep 200" simulation -terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.

However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.

dummy_simulation.txt workers.txt

bgunnar5 commented 1 year ago

I started work on this last week but I think we're going to have to revisit this at a later date (once the new merlin status command is released). Here's what I found so far, though:

Everything here is leading me to believe this is a celery issue but we'll see if the new status command can provide us with more information.

bgunnar5 commented 1 year ago

Here are some links that may be helpful for this issue going forward:

This will require more research but the discussion of these user issues seems to be similar to the problem here. The issue may be due to ETA/countdown with celery tasks:

  1. Tasks are living in the queue with countdown set and a prefetch multiplier of 1
  2. A task is picked up by a celery worker
  3. That task takes a long time to complete so the countdown timer may expire for other tasks in the queue (which can't be completed since the prefetch-multiplier is 1)
  4. The long running task eventually needs to retry but the worker keeps it in it's own memory rather than releasing it back to the queue
  5. Another worker frees up but since the retried task is in the memory of the other worker it can't be fetched by this free worker so it never gets completed

This is my general understanding of the problem so far.

Thank you @lucpeterson for the links.