Closed ghost closed 3 years ago
OK I found the problem, I was using gru instead of lstm for transducer, apparently gru is not compatible with rnnt loss on multi gpu :(
@DongyaoZhu I forgot to rename the RNN layer so its name is always "lstm" so it's hard to debug :laughing: I see GRU is getting deprecated, new papers always use LSTM since it's better.
Environment: google cloud instance, debian10, tf2.3, cuda11, 4x tesla t4 Model: conformer, training with train_ga_subword_conformer.py Config: batch size=4, ga=1, others are the same as given one, training on libritts train-clean-100, dev on libritts test other
Hello, I can train with on single gpu, but if I specify devices to more gpus, then it fails and gives these traces:
It lookes like the lstm is reading out of bound, can someone give some help? Thanks in advance!