NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
648 stars 94 forks source link

Matplotlib Deadlock in a Slurm job #168

Open szhengac opened 1 year ago

szhengac commented 1 year ago

Hi,

We just came across a dead lock issue in using NeMo training on Slurm with pyxis+enroot:

  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 19, in <module>
    from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
  File "/opt/NeMo/nemo/collections/nlp/__init__.py", line 15, in <module>
    from nemo.collections.nlp import data, losses, models, modules
  File "/opt/NeMo/nemo/collections/nlp/data/__init__.py", line 42, in <module>
    from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
  File "/opt/NeMo/nemo/collections/nlp/data/zero_shot_intent_recognition/__init__.py", line 16, in <module>
    from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
  File "/opt/NeMo/nemo/collections/nlp/data/zero_shot_intent_recognition/zero_shot_intent_dataset.py", line 30, in <module>
    from nemo.collections.nlp.parts.utils_funcs import tensor2list
  File "/opt/NeMo/nemo/collections/nlp/parts/__init__.py", line 17, in <module>
    from nemo.collections.nlp.parts.utils_funcs import list2str, tensor2list
  File "/opt/NeMo/nemo/collections/nlp/parts/utils_funcs.py", line 24, in <module>
    from matplotlib import pyplot as plt
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.py", line 52, in <module>
    import matplotlib.colorbar
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/colorbar.py", line 19, in <module>
    from matplotlib import _api, cbook, collections, cm, colors, contour, ticker
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/contour.py", line 13, in <module>
    from matplotlib.backend_bases import MouseButton
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/backend_bases.py", line 45, in <module>
    from matplotlib import (
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/text.py", line 16, in <module>
    from .font_manager import FontProperties
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/font_manager.py", line 1551, in <module>
    fontManager = _load_fontmanager()
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/font_manager.py", line 1546, in _load_fontmanager
    json_dump(fm, fm_path)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/font_manager.py", line 958, in json_dump
    with cbook._lock_path(filename), open(filename, 'w') as fh:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/cbook/__init__.py", line 1814, in _lock_path
    raise TimeoutError("""\
TimeoutError: Lock error: Matplotlib failed to acquire the following lock file:
    /root/.cache/matplotlib/fontlist-v330.json.matplotlib-lock
This maybe due to another process holding this lock file.  If you are sure no
other Matplotlib process is running, remove this file and try again.

I thought different processes should be using different container resource, but such dead lock issue seems to indicate something different from what I thought. Any insight will be appreciated. Thanks!

flx42 commented 1 year ago

If you have ENROOT_MOUNT_HOME y in /etc/enroot/enroot.conf then /root will be mounted from your HOME , e.g. /home/felix -> /root. And if your HOME is a filesystem shared across compute nodes (like NFS), then the same folders might be reused across ranks / jobs.

You can edit the enroot configuration file, or add --no-container-mount-home to the srun command line.

szhengac commented 1 year ago

@flx42 We already have --no-container-mount-home in the srun command.

szhengac commented 1 year ago

@flx42 any other idea?

flx42 commented 1 year ago

Do you have multiple processes running per node? In this case they will also share the same filesystem.

szhengac commented 1 year ago

There are 8 processes per node, with each using one GPU.

flx42 commented 1 year ago

Please file an issue against NeMo then, I'm not sure in which scenario you need to call matplotlib on 8 processes simultaneously.

szhengac commented 1 year ago

Sure. I will file an issue with NeMo. Pytorch lightning also has an import matplotlib which causes this timeout in the distributed training with enroot.