georgesterpu / avsr-tf1

Audio-Visual Speech Recognition using Sequence to Sequence Models
GNU General Public License v3.0
81 stars 28 forks source link

The result in noisy environment #25

Open xjwla opened 3 years ago

xjwla commented 3 years ago

Hi,

Thank you for your open-source codes. With your help, I have reproduced the result under noiseless conditions. In ao, I get the result cer:19.48% and wer:44.91. But at the 10db cafe noisy, The result is not convergent at all. In the case of noise, do we need to modify any parameters. Could you please give me some suggestions?

Thanks a lot.

xjwla commented 3 years ago

Hi,

I load the weight of ao in the clean model, instead of train from scratch. The result is improved under 10dB of cafeteria noise conditions. The CER is 35.64%, WER is 62.39%. But it didn't achieve the results in the paper, CER 25.61% and WER 54.48%. Could you please give me some suggestions?

Thanks a lot.

georgesterpu commented 3 years ago

Hi @xjwla

Are you reproducing the results from our ICMI'18 article on TCD-TIMIT?

Yes, my training pipeline involves a multi-stage process where the same model is fine-tuned on gradually increasing audio SNRs. You can see this as a form of curriculum learning. Otherwise, it would be difficult to learn good representations directly on noisy data samples.

A signal to noise ratio of 10db is still a relatively easy condition, so there shouldn't be large differences to clean speech. At least on LRS2 I don't remember ever seeing convergence issues up to 10db.

Are you using the code in this repository, or have you re-implemented the networks in your own framework? What about the data pipeline? Can you listen to a few samples to find out if they match their advertised SNR? Seq2seq networks with LSTMs are quite tricky to train, particularly on a small dataset, but the default settings in this repository ensure that you have all the bells and whistles enabled (in tf 1.x !).

Without having specific info regarding your experiment, there is a large number of possible causes. My best advice would be to first validate your current setup on a more estalished audio-only dataset like Librispeech, so you could rule out eventual issues in the code or in the data pipeline.

xjwla commented 3 years ago

Thank you very much for your reply.

Yes, I am reproducing the results from ICMI'18 on TCD-TIMIT. And I am using the code in this repository. I have successfully reproduced the results in your paper in a clean speech. But I encounter some problems with the noise condition as mentioned above. I use the write_records_tcd.py in this repository to generate tfrecordfile that adds noise. The difference between my settings and the default settings is I alter the feature type to 'logmel'. The default setting is 'logmel_stack_w8s3'.

And now I am looking for the reason according to your suggestion. There is no alter in the networks, but the result of a speech with10db noise is much worse than a clean speech. Maybe I got it wrong in 'write_records_tcd.py'?

Thank you so much, you are so kindly.

georgesterpu commented 3 years ago

Thanks a lot for the clarifications, @xjwla

Hmm, I reckon that the audio sequence pre-processing could have a big impact on attention-based seq2seq models.

The main difference between logmel and logmel_stack_w8s3 is the feature framerate and the amount of information per frame. The former computes the log magnitude spectrum based on a short-time Fourier transform with a frame length of 25ms and a step size of 10ms. The latter considers a window of 8 consecutive STFT frames, and applies a stride of 3, thus the frame-rate decreases by a factor of 3, and the receptive field is about 95ms pe frame (80+15). Some research literature suggests that low frame-rates are necessary for the CTC/RNN-T model family. This repository implements a seq2seq with global (full utterance) attention, which may be prohibitively expensive to train.

Can you try the logmel_stack_w8s3 transform and see if it makes a difference on 10db SNR? As you can see in the code, stack_w8s3 is a simple post-processing of logmel, so it doesn't change the data samples.

xjwla commented 3 years ago

I try the logmel_stack_w8s3 transform on 10db SNR. Unfortunately, it didn't make a large difference. The result of cer is 31.07%, wer is 58.53%. Now on 10db SNR, first, I load the model parameters on clean speech (the cer for clean speech is 20.85%, were is 45.58% ) and training until the error rate no longer decreases. And then reduce the learning rate from 0.001 to 0.0001. But the result is still not ideal. Do I need to change some of the parameters compared to training in clean conditions? Thanks a lot.

georgesterpu commented 3 years ago

The example run_audio.py script is designed so that you can launch a full experiment under very similar conditions to what is described in the article, excepting the number of epochs per noise level. In case you are using a modified version of this script (e.g. your own data paths), could you please paste its contents here? The default parameters of the AVSR class can be overwritten from the main launch script through kwargs, if needed. To answer your question, you don't need to change any hyper-parameter under different noise conditions.

If you provide all the audio record files (i.e. clean, 10db, 0db, -5db), there is no need to manually load model parameters from checkpoints. The AVSR.train method will take care of that. Again, training directly on noisy samples is likely to worsen the accuracy on TCD-TIMIT, and would like to have a clearer picture of the experiment you are running.

Also, what dataset partitioning are you using?

I hope this helps. Please let me know if you find the cause of your issue.