Closed KatharineShapcott closed 3 years ago
Hi Katharine! Hm, I think you're right, this looks like a bug in ACME's mechanism for fetching the IDs of crashed jobs. I checked the logs at the location given in the screenshot. I found multiple instances of
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/hpx/slurm/shapcottk/shapcottk_20210702-085114/dask-worker-space/worker-qhea06fx/storage/ndarray-17faf5ac06156c51d7a8e039c0628027'
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
So it seems the workers crashing was indeed triggered by something in your code but ACME's error handling crashed on top of it. I'll look into this - thanks for reporting!
All the best, Stefan
Hi Stefan, Thanks for checking! I think it might have happened another time when my jobs timed out in the XS cue but I wasn't really paying attention. Best, Katharine
Hey Katharine! It seems that SLURMCluster
is not necessarily numbering its workers starting with 0 which causes the bug you ran into. I've included a workaround in my latest push to the dev branch (44c369f). Feel free to test-drive the updated version when you have time to see if the problem persists.
Hi Stefan, I've seen this error multiple times now with two versions of acme. Let me know if you think it's something specific with my code and I can try and figure out what's causing it. Best, Katharine