HPI-DeepLearning / crnn-lid

Code for the paper Language Identification Using Deep Convolutional Recurrent Neural Networks
GNU General Public License v3.0
105 stars 48 forks source link

Model Overfitting #39

Open Vadim2S opened 1 year ago

Vadim2S commented 1 year ago

I am try 4 different datasets. The biggest one contains 4 languages with 20600 pngs with 10 second spectrogramm for each language.

No luck. Train accuracy is 0.97 Validation and Test accuracy is 0.2 - 0.4. What dataset size I am must use?

P.S. I am use you default config. I am changed code (a little) to use Keras2 and Tensorflow 1.14.

Themba4Sho commented 1 year ago

Hey there. 20600 specs should be enough to get you decent results. My two cents would be to: 1) check the language distribution in your datasets, is there a language that is unreasonably higher than the rest?, 2) Is your test data from the same 20600? If it is not, then it's possible that the test is too different from your training data and you might need to try some augmentation techniques to accommodate the uniqueness of your test set.

Vadim2S commented 1 year ago

The dataset preparation code of this project guarantied equal amount of specs for each language. I am do not normalize audio beforehand (authors too) however.

I am just run wav_to_spectrogram.py, create_csv.py, train.py with default config.yaml topcoder_crnn_finetune model. Later I am also try inceptionv3_crnn.py model.

I am try: 1) Voxforge dataset - 5 language with 10020 spec each. 100 pixel per second i.e 5 second specs with default input shape 2) Mozilla_Common_Voice - 4 language with 16650 spec each. 100 pixel per second i.e 5 second specs 3) MTEDX dataset - 4 language with 21000 spec each. 50 pixel per second i.e 10 second specs 4) MTEDX dataset - 3 language with 43000 spec each. 50 pixel per second i.e 10 second specs 5) My own native dataset - 9 language with 5000 spec each. 100 pixel per second

MTDEX dataset can be found here https://www.openslr.org/100/

Results:

Voxforge dataset - 5 language with 10020 - topcoder_crnn_finetune - Overfitting loss: 0.4348 - accuracy: 0.9660 - val_loss: 17.8252 - val_accuracy: 0.2000 Epoch 00013: val_accuracy did not improve from 0.28015 Epoch 00013: early stopping

MTEDX dataset - 4 language with 21000 - topcoder_crnn_finetune - Overfitting loss: 0.3180 - accuracy: 0.9607 - val_loss: 3.1414 - val_accuracy: 0.4399 Epoch 00013: val_accuracy did not improve from 0.67722 Epoch 00013: early stopping

MTEDX dataset - 3 language with 43000 - topcoder_crnn_finetune - Overfitting loss: 0.3966 - accuracy: 0.9733 - val_loss: 6.0934 - val_accuracy: 0.3294 Epoch 00014: val_accuracy did not improve from 0.33373 Epoch 00014: early stopping

MTEDX dataset - 3 language with 43000 - inceptionv3_crnn - Just very poor result loss: 0.9407 - accuracy: 0.5551 - val_loss: 0.9851 - val_accuracy: 0.5081 Epoch 00031: val_accuracy did not improve from 0.53137 Epoch 00031: early stopping

Here interesting experiment - I am comment layer.trainable=False line in model. Result is much better but still bad

MTEDX dataset - 3 language with 43000 - topcoder_crnn_finetune - Overfitting loss: 0.2834 - accuracy: 0.9531 - val_loss: 1.4304 - val_accuracy: 0.6554 Epoch 00016: val_accuracy did not improve from 0.82805 Epoch 00016: early stopping

No luck.

P.S. Model conversion code sample is below and this is very simple:

Original:

    model.add(Convolution2D(16, 7, 7, W_regularizer=l2(weight_decay), activation="relu", input_shape=input_shape))

My code for Keras2 and TF1.14

    model.add(Conv2D(16, (7, 7), activation="relu", input_shape=input_shape, kernel_regularizer=l2(weight_decay)))