Hello, I tried to train eagle on 4x A6000 and found that the training process could crash down sometimes. The error seems to be about multi-gpu connection timeout (>30min).
I suspected it's because of crashing of one gpu, however all gpus works properly until the whole program is down (according to the temperature), it's strange to throw a timeout.
Although I think it's my own problem, I would appreciate it if you have any experience or advice to address this problem. Thank you!
Hello, I tried to train eagle on 4x A6000 and found that the training process could crash down sometimes. The error seems to be about multi-gpu connection timeout (>30min). I suspected it's because of crashing of one gpu, however all gpus works properly until the whole program is down (according to the temperature), it's strange to throw a timeout. Although I think it's my own problem, I would appreciate it if you have any experience or advice to address this problem. Thank you!