ModernElectron / WarpX

Modern Electron's WarpX repository serves as both a fork of the WarpX code (an advanced electromagnetic Particle-In-Cell code - see https://ecp-warpx.github.io) and the repository for a set of tools used in simulating thermionic devices (mewarpx).
https://mewarpx.readthedocs.io/en/latest/index.html
Other
0 stars 1 forks source link

Matplotlib Fails To Acquire Lockfile in large MPI jobs #114

Closed peterscherpelz closed 2 years ago

peterscherpelz commented 2 years ago

On large MPI jobs, some threads may fail to obtain a matplotlib lockfile:

  File "run_simulation.py", line 15, in <module>
    from mewarpx import diags, sim_control, runinfo
  File "/merunset/WarpX/mewarpx/mewarpx/diags.py", line 7, in <module>
    from mewarpx.diags_store.field_diagnostic import *
  File "/merunset/WarpX/mewarpx/mewarpx/diags_store/field_diagnostic.py", line 5, in <module>
    from mewarpx.utils_store import util, mwxconstants, plotting
  File "/merunset/WarpX/mewarpx/mewarpx/utils_store/plotting.py", line 1, in <module>
    import matplotlib.pyplot as plt
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/pyplot.py", line 49, in <module>
    import matplotlib.colorbar
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/colorbar.py", line 21, in <module>
    from matplotlib import _api, collections, cm, colors, contour, ticker
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/contour.py", line 13, in <module>
    from matplotlib.backend_bases import MouseButton
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 46, in <module>
    from matplotlib import (
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/textpath.py", line 8, in <module>
    from matplotlib import _text_helpers, dviread, font_manager
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/font_manager.py", line 1447, in <module>
    fontManager = _load_fontmanager()
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/font_manager.py", line 1442, in _load_fontmanager
    json_dump(fm, fm_path)
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/font_manager.py", line 1003, in json_dump
    with cbook._lock_path(filename), open(filename, 'w') as fh:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/me_user/.local/lib/python3.8/site-packages/matplotlib/cbook/__init__.py", line 1774, in _lock_path
    raise TimeoutError("""\
TimeoutError: Lock error: Matplotlib failed to acquire the following lock file:
    /home/me_user/.cache/matplotlib/fontlist-v330.json.matplotlib-lock
This maybe due to another process holding this lock file.  If you are sure no
other Matplotlib process is running, remove this file and try again.

https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/cbook/__init__.py#L1748 does show the problem fairly clearly: It takes 50 tries at grabbing a lock file. With 64 threads here, it's reasonable that some can fail.

My initial thought is to avoid importing matplotlib.pyplot on everything except proc 0. I'm not sure how much work this would be though. Thoughts?

PhilMiller commented 2 years ago

That retry logic uses a known-bad approach. It should randomize the sleep, so the failed threads don't all wake up at the same time and content

peterscherpelz commented 2 years ago

@PhilMiller can make a matplotlib PR if desired; @peterscherpelz will fix the imports.

PhilMiller commented 2 years ago

Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later

PhilMiller commented 2 years ago

I just checked that it will indeed create the cached file, and that the code will try to read from an existing cache file before trying to generate it.

PhilMiller commented 2 years ago

Incidentally, on a c6g.16xlarge AWS instance, I'm unable to reproduce the error in isolation, with commands like

for i in $(seq 20); do rm -rf ~/.cache/matplotlib/; mpiexec -np 60 --oversubscribe $(which python) -c "from matplotlib import font_manager"; done

PhilMiller commented 2 years ago

Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later

That could be as simple as python -c "import matplotlib.pyplot"

peterscherpelz commented 2 years ago

Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later

That could be as simple as python -c "import matplotlib.pyplot"

@PhilMiller Do you think it would work / make sense for us to just put that command into the Dockerfile build? Then the image would already have the cache files included, I think?

PhilMiller commented 2 years ago

Yes, that would make sense

peterscherpelz commented 2 years ago

Cool, I'll make a PR to do that.

roelof-groenewald commented 2 years ago

Closing after merging #116.