transducer_loss is negative

Honghe commented 4 years ago

Environment:

Ubuntu 20.04
TensorFlow 2.3
TensorFlowASR main

Command:

python examples/conformer/train_subword_conformer.py --tbs 8 --ebs 8 --mxp --devices 0 --cache --subwords ./output/librispeed \
--subwords_corpus \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/test-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv

Log:

Model: "conformer"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_encoder (ConformerEncoder)                  (None, None, 144)                               8710848           
________________________________________________________________________________________________________________________
conformer_prediction (TransducerPrediction)           (None, None, 320)                               1151040           
________________________________________________________________________________________________________________________
conformer_joint (TransducerJoint)                     (None, None, None, 1031)                        479751            
========================================================================================================================
Total params: 10,341,639
Trainable params: 10,337,031
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv ...
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv ...
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv ...
[Train] |                    | 0/71340 [00:00<?, ?batch/s]2020-10-21 10:26:47.344713: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2020-10-21 10:26:47.417289: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
[Train] [Epoch 1/20] |                    | 45/71340 [02:59<71:52:55,  3.63s/batch, transducer_loss=-298.91907]

nglehuy commented 4 years ago

Hi @Honghe, the cause is wrong installation of rnnt loss, remember to export CUDA_HOME=/usr/local/cuda before running scripts/install_rnnt_loss.sh

Honghe commented 4 years ago

After rebuilt warprnnt_tensorflow with CUDA, it seems to be running correctly.

[Train] [Epoch 1/20] |                    | 56/142680 [00:42<6:16:25,  6.31batch/s, transducer_loss=2464.3044]

But it will consume 6 hours to train one epoch, is this normal? Environment: i7 cpu, one 2080Ti GPU.

nglehuy commented 4 years ago

@Honghe in your case, it's the total of 6 hours for 20 epochs

Honghe commented 4 years ago

@usimarit --tbs 8 --ebs 8 makes TensorFlow CUDA OOM after a few steps, is there memory leaking somewhere? Batch size 8 is a little small, how to effectively increase batch size?

python examples/conformer/train_subword_conformer.py --tbs 8 --ebs 8 --mxp --devices 0 --cache --subwords ./output/librispeed \
--subwords_corpus \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/test-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv

nglehuy commented 4 years ago

@Honghe I'm working on gradient accumulation to increase batch size. Technically, when training ASR, batch size small is normal, there's no memory leaks because this model requires lots of memory (due to multi-head self attention), regardless its small number of parameters.