Closed DarkSector closed 3 months ago
We can run the command:
srun -N 16 bash -c 'echo "$(hostname): $(date)"' | sort -k2,3
And somehow programmatically make sure the dates are within 1 sec or each other.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
In rare cases, PyTorch will timeout due to drift in system clock across nodes. A pre-check may be useful to diagnose this issue before training run starts.
Add test to hyperpod-precheck.py