Open szhengac opened 1 year ago
If you have ENROOT_MOUNT_HOME y
in /etc/enroot/enroot.conf
then /root
will be mounted from your HOME
, e.g. /home/felix -> /root
. And if your HOME
is a filesystem shared across compute nodes (like NFS), then the same folders might be reused across ranks / jobs.
You can edit the enroot configuration file, or add --no-container-mount-home
to the srun
command line.
@flx42 We already have --no-container-mount-home
in the srun
command.
@flx42 any other idea?
Do you have multiple processes running per node? In this case they will also share the same filesystem.
There are 8 processes per node, with each using one GPU.
Please file an issue against NeMo then, I'm not sure in which scenario you need to call matplotlib on 8 processes simultaneously.
Sure. I will file an issue with NeMo. Pytorch lightning also has an import matplotlib
which causes this timeout in the distributed training with enroot.
Hi,
We just came across a dead lock issue in using NeMo training on Slurm with pyxis+enroot:
I thought different processes should be using different container resource, but such dead lock issue seems to indicate something different from what I thought. Any insight will be appreciated. Thanks!