aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
180 stars 74 forks source link

Add time sync checks across all nodes to verify nodes aren't drifting apart. #180

Closed DarkSector closed 3 months ago

DarkSector commented 7 months ago

In rare cases, PyTorch will timeout due to drift in system clock across nodes. A pre-check may be useful to diagnose this issue before training run starts.

Add test to hyperpod-precheck.py

sean-smith commented 6 months ago

We can run the command:

srun -N 16 bash -c 'echo "$(hostname): $(date)"' | sort -k2,3

And somehow programmatically make sure the dates are within 1 sec or each other.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.