Open innarid opened 3 years ago
Could you run without distributed training? Do you have all 1-9 epochs passed fine?
I got the same error with -enable_distributed=false. Yes, 9 epochs passed fine.
Can you confirm that if you rerun training from epoch 1 it is still working?
I got error(Floating point exception) while training 10 epoch. error.txt Log of 9 epoch looks good 009_log.txt I tried to change lr and lr decay, but it didn't help. Could you please help me to find the reason of this error? Thanks!