Closed peterscherpelz closed 2 years ago
That retry logic uses a known-bad approach. It should randomize the sleep, so the failed threads don't all wake up at the same time and content
@PhilMiller can make a matplotlib PR if desired; @peterscherpelz will fix the imports.
Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot
before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later
I just checked that it will indeed create the cached file, and that the code will try to read from an existing cache file before trying to generate it.
Incidentally, on a c6g.16xlarge AWS instance, I'm unable to reproduce the error in isolation, with commands like
for i in $(seq 20); do rm -rf ~/.cache/matplotlib/; mpiexec -np 60 --oversubscribe $(which python) -c "from matplotlib import font_manager"; done
Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later
That could be as simple as python -c "import matplotlib.pyplot"
Possible solution easier than fixing the imports: run a serial script that does import matplotlib.pyplot before the MPI job. If that sets up the files in the cache, then there shouldn't be contention to generate them later
That could be as simple as
python -c "import matplotlib.pyplot"
@PhilMiller Do you think it would work / make sense for us to just put that command into the Dockerfile build? Then the image would already have the cache files included, I think?
Yes, that would make sense
Cool, I'll make a PR to do that.
Closing after merging #116.
On large MPI jobs, some threads may fail to obtain a matplotlib lockfile:
https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/cbook/__init__.py#L1748 does show the problem fairly clearly: It takes 50 tries at grabbing a lock file. With 64 threads here, it's reasonable that some can fail.
My initial thought is to avoid importing
matplotlib.pyplot
on everything except proc 0. I'm not sure how much work this would be though. Thoughts?