Open rondogency opened 4 years ago
@jarednielsen said he saw it once, but it is because fsx runs out of space, meanwhile we are storing data in /scratch NVMe and fsx has plenty of space
192.168.68.4@tcp:/fsx 1.1T 7.5G 1.1T 1% /shared
Will take Jared's suggestion to copy data to fsx and run it again to see if it can reproduce
Update: can reproduce when putting all the data to /fsx folder
Update: reduce the batch size per gpu to half but still have the same crash
We have seen TF2 Albert pretraining crashes intermittently every 1 out of ~3 runs using latest Horovod training on 8 nodes; the crash happens around 3000 steps
Error message:
Version Used: Horovod: 0.20.0 TF: 2.3
Launch command used: (adapted from the config in Albert README)