SafeAILab / EAGLE

Official Implementation of EAGLE-1 and EAGLE-2
https://arxiv.org/pdf/2406.16858
Apache License 2.0
738 stars 74 forks source link

training sometimes crashes down #94

Closed Monoclinic closed 1 week ago

Monoclinic commented 1 month ago

Hello, I tried to train eagle on 4x A6000 and found that the training process could crash down sometimes. The error seems to be about multi-gpu connection timeout (>30min). I suspected it's because of crashing of one gpu, however all gpus works properly until the whole program is down (according to the temperature), it's strange to throw a timeout. image Although I think it's my own problem, I would appreciate it if you have any experience or advice to address this problem. Thank you!

Liyuhui-12 commented 1 month ago

Are there any error messages?