Open LearnedVector opened 5 years ago
I have the same problem. Do you find any solution?
@yetiancn unfortunately no I did not find a solution :/ instead I just switched over to use the tensorflow ctc implementation
I decide to try tensorflow ctc too. Thank you!
how to slove it ?
Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.
Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!