GalSim-developers / GalSim

The modular galaxy image simulation toolkit. Documentation:
http://galsim-developers.github.io/GalSim/
Other
224 stars 105 forks source link

Let multiprocessing timeout interval be specifiable. #1180

Closed rmjarvis closed 2 years ago

rmjarvis commented 2 years ago

@myamamoto26 ran into a cryptic config error that looked like:

Traceback (most recent call last):
[... snip ...]
  File "/global/homes/m/myamamot/miniforge3/envs/new-eastlake-dev/lib/python3.9/site-packages/galsim/config/util.py", line 863, in MultiProcess
    res, k, t, proc = results_queue.get(timeout=300)
  File "/global/homes/m/myamamot/miniforge3/envs/new-eastlake-dev/lib/python3.9/multiprocessing/queues.py", line 114, in get
    raise Empty
_queue.Empty

Not super informative. But @beckermr figured out that it is due to the results_queue.get command timing out after 300 seconds. Masa was parallelizing over images and 300 seconds wasn't enough for the job. In retrospect, it was a bad idea to hard code this number. And furthermore, 5 minutes is not long enough for lots of use cases in the wild, so the default should be larger. Especially when parallelizing over large images, rather than over stamps, which is probably what I was thinking when I chose that value.

Anyway, this PR does the following:

  1. Changes the default timeout to 3600 when parallelizing over files and 900 when parallelizing over stamps. These seem more plausibly acceptable in most cases while still guarding against multiprocessing going dark (which can happen for various possible problems, which is why I have this at all) and hanging forever.
  2. Lets timeout be specifiable by the user in the config file, so if a user knows that their objects might take more than this, they can increase this value.
  3. Changes the error message to "Multiprocessing timed out waiting for a task to finish." with a suggestion to increase timeout if this seems like it might be required.

In addition, while working on item 3, I discovered PEP 409, which lets you re-raise a different exception without getting the annoying "During handling of the above exception, another exception occurred:" thing by adding from None to the raise line. There are quite a few places in GalSim where we do this, so I went through and added this to the places where it seemed that the original error context was not potentially useful to the user when trying to understand the error. (In some cases, especially OSErrors, it is useful to see the original error message, so I left those alone.)