Open lidavid88 opened 10 months ago
It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:
dmesg
and the slurmd log on the bad nodes for any clue.They seem to have the same versions and drivers.
I have discovered a problem on running container on a cluster.
I am using a nvidia pytorch container created with enroot in the following submit script:
On most nodes srun is executed and I get 0 printed to the log.
But on the other nodes I get 2 types of errors:
1.
2.
This error does not appear, if I only use up to 4 nodes.
With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.
My guess is that the inter node communication is having troubles with pyxis.
Can someone help me with that?
Regards