TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
938 stars 245 forks source link

transducer_loss is negative #32

Closed Honghe closed 4 years ago

Honghe commented 4 years ago

Environment:

Command:

python examples/conformer/train_subword_conformer.py --tbs 8 --ebs 8 --mxp --devices 0 --cache --subwords ./output/librispeed \
--subwords_corpus \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/test-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv 

Log:

Model: "conformer"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_encoder (ConformerEncoder)                  (None, None, 144)                               8710848           
________________________________________________________________________________________________________________________
conformer_prediction (TransducerPrediction)           (None, None, 320)                               1151040           
________________________________________________________________________________________________________________________
conformer_joint (TransducerJoint)                     (None, None, None, 1031)                        479751            
========================================================================================================================
Total params: 10,341,639
Trainable params: 10,337,031
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv ...
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv ...
Reading /home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv ...
[Train] |                    | 0/71340 [00:00<?, ?batch/s]2020-10-21 10:26:47.344713: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2020-10-21 10:26:47.417289: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
[Train] [Epoch 1/20] |                    | 45/71340 [02:59<71:52:55,  3.63s/batch, transducer_loss=-298.91907] 
nglehuy commented 4 years ago

Hi @Honghe, the cause is wrong installation of rnnt loss, remember to export CUDA_HOME=/usr/local/cuda before running scripts/install_rnnt_loss.sh

Honghe commented 4 years ago

After rebuilt warprnnt_tensorflow with CUDA, it seems to be running correctly.

[Train] [Epoch 1/20] |                    | 56/142680 [00:42<6:16:25,  6.31batch/s, transducer_loss=2464.3044]  

But it will consume 6 hours to train one epoch, is this normal? Environment: i7 cpu, one 2080Ti GPU.

nglehuy commented 4 years ago

@Honghe in your case, it's the total of 6 hours for 20 epochs

Honghe commented 4 years ago

@usimarit --tbs 8 --ebs 8 makes TensorFlow CUDA OOM after a few steps, is there memory leaking somewhere? Batch size 8 is a little small, how to effectively increase batch size?

python examples/conformer/train_subword_conformer.py --tbs 8 --ebs 8 --mxp --devices 0 --cache --subwords ./output/librispeed \
--subwords_corpus \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/train-clean-100/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/test-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-clean/transcripts.tsv \
/home/ubuntu/Data/LibriSpeechConformer/LibriSpeech/dev-other/transcripts.tsv 
nglehuy commented 4 years ago

@Honghe I'm working on gradient accumulation to increase batch size. Technically, when training ASR, batch size small is normal, there's no memory leaks because this model requires lots of memory (due to multi-head self attention), regardless its small number of parameters.

namdn commented 3 years ago

I already install rnnt loss by export CUDA_HOME=/usr/local/cuda and scripts/install_rnnt_loss.sh but the transducer_loss is still negative. Any one can help me. Thanks tensorboard

nglehuy commented 3 years ago

@namdn Please use rnnt_loss in TF, the warp-transducer is deprecated

namdn commented 3 years ago

@usimarit how can I run without warp-transducer.

namdn commented 3 years ago

I mean i do the same as you say in Readme. For training Transducer Models with RNNT Loss in TF, make sure that warp-transducer is not installed (by simply run pip3 uninstall warprnnt-tensorflow). But the transducer_loss is still negative.

nglehuy commented 3 years ago

@namdn then please check your dataset and vocabulary file, make sure the vocabulary file covers all of characters or subwords in your dataset.

nglehuy commented 3 years ago

@namdn if you are using subwords and tfrecords, make sure you pass the same subword file that used to generate tfrecords.

namdn commented 3 years ago

Now i am using CharFeaturizer as text_featureizers and using script train_conformer.py for trainning. In config.yml the value for decoder_config.vocabulary is language.characters file? I configurate it to my language characters file and the value is still negative. Any other suggestion?

nglehuy commented 3 years ago

@namdn can you try script train_keras_conformer.py?

namdn commented 3 years ago

thank you very much. It worked for me.