Closed Honghe closed 4 years ago
Hi @Honghe, the cause is wrong installation of rnnt loss, remember to export CUDA_HOME=/usr/local/cuda
before running scripts/install_rnnt_loss.sh
After rebuilt warprnnt_tensorflow with CUDA, it seems to be running correctly.
[Train] [Epoch 1/20] | | 56/142680 [00:42<6:16:25, 6.31batch/s, transducer_loss=2464.3044]
But it will consume 6 hours to train one epoch, is this normal? Environment: i7 cpu, one 2080Ti GPU.
@Honghe in your case, it's the total of 6 hours for 20 epochs
@usimarit --tbs 8 --ebs 8
makes TensorFlow CUDA OOM after a few steps, is there memory leaking somewhere?
Batch size 8 is a little small, how to effectively increase batch size?
python examples/conformer/train_subword_conformer.py --tbs 8 --ebs 8 --mxp --devices 0 --cache --subwords ./output/librispeed \
--subwords_corpus \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/test-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv
@Honghe I'm working on gradient accumulation to increase batch size. Technically, when training ASR, batch size small is normal, there's no memory leaks because this model requires lots of memory (due to multi-head self attention), regardless its small number of parameters.
I already install rnnt loss by export CUDA_HOME=/usr/local/cuda
and scripts/install_rnnt_loss.sh
but the transducer_loss
is still negative. Any one can help me. Thanks
@namdn Please use rnnt_loss in TF, the warp-transducer is deprecated
@usimarit how can I run without warp-transducer
.
I mean i do the same as you say in Readme. For training Transducer Models with RNNT Loss in TF, make sure that warp-transducer is not installed (by simply run pip3 uninstall warprnnt-tensorflow)
. But the transducer_loss
is still negative.
@namdn then please check your dataset and vocabulary file, make sure the vocabulary file covers all of characters or subwords in your dataset.
@namdn if you are using subwords and tfrecords, make sure you pass the same subword file that used to generate tfrecords.
Now i am using CharFeaturizer
as text_featureizers
and using script train_conformer.py
for trainning. In config.yml
the value for decoder_config.vocabulary
is language.characters
file? I configurate it to my language characters file and the value is still negative. Any other suggestion?
@namdn can you try script train_keras_conformer.py
?
thank you very much. It worked for me.
Environment:
Command:
Log: