esi-neuroscience / acme

Asynchronous Computing Made ESI
https://esi-acme.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

KeyError when trying to calc erredJobIDs #30

Closed KatharineShapcott closed 3 years ago

KatharineShapcott commented 3 years ago

Hi Stefan, I've seen this error multiple times now with two versions of acme. Let me know if you think it's something specific with my code and I can try and figure out what's causing it. Best, Katharine

tcp://10.100.32.4:44461' processes=147 threads=147, memory=2.35 TB>
<ParallelMap> INFO: Preparing 200 parallel calls of `comparison_preprocessing` using 200 workers
<ParallelMap> INFO: Log information available at /mnt/hpx/slurm/shapcottk/shapcottk_20210702-085114
  0% |                                                                                                                                            | 0/200 [10:32<?]
<ParallelMap> INFO: <ACME> Exception received: <class 'KeyError'>: 5
Traceback (most recent call last):
  File "filter_net_paper.py", line 606, in <module>
    main()
  File "filter_net_paper.py", line 404, in main
    results = pmap.compute()
  File "/mnt/pns/home/shapcottk/python/filter_net_paper/scripts/acme/backend.py", line 420, in compute
    erredJobIDs = [self.client.cluster.workers[job].job_id for job in erredJobs]
  File "/mnt/pns/home/shapcottk/python/filter_net_paper/scripts/acme/backend.py", line 420, in <listcomp>
    erredJobIDs = [self.client.cluster.workers[job].job_id for job in erredJobs]
KeyError: 5

image

pantaray commented 3 years ago

Hi Katharine! Hm, I think you're right, this looks like a bug in ACME's mechanism for fetching the IDs of crashed jobs. I checked the logs at the location given in the screenshot. I found multiple instances of

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/hpx/slurm/shapcottk/shapcottk_20210702-085114/dask-worker-space/worker-qhea06fx/storage/ndarray-17faf5ac06156c51d7a8e039c0628027'
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting

So it seems the workers crashing was indeed triggered by something in your code but ACME's error handling crashed on top of it. I'll look into this - thanks for reporting!

All the best, Stefan

KatharineShapcott commented 3 years ago

Hi Stefan, Thanks for checking! I think it might have happened another time when my jobs timed out in the XS cue but I wasn't really paying attention. Best, Katharine

pantaray commented 3 years ago

Hey Katharine! It seems that SLURMCluster is not necessarily numbering its workers starting with 0 which causes the bug you ran into. I've included a workaround in my latest push to the dev branch (44c369f). Feel free to test-drive the updated version when you have time to see if the problem persists.