facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

No decrease of wer when fine tuning wav2vec 2.0 #2685

Closed ezerhouni closed 3 years ago

ezerhouni commented 3 years ago

I am trying to replicate the paper by fine-tuning the Wa2Vec 2.0 No finetuning base model with 1h of librilight. As written in the readme, I am doing the following command :

python3 fairseq/train.py \
    --distributed-world-size 6  /path/to/libri-light/1h \
    --save-dir path/to/model_checkpoint \
    --fp16 \
    --wer-args '("path/4-gram.bin","path/librispeech_lexicon.lst",2,-1)' \
    --post-process letter \
    --valid-subset valid \
    --no-epoch-checkpoints \
    --best-checkpoint-metric wer \
    --num-workers 4 \
    --max-update 13000 \
    --sentence-avg \
    --task audio_pretraining \
    --arch wav2vec_ctc \
    --w2v-path path/to/wav2vec_small.pt \
    --labels ltr \
    --apply-mask \
    --mask-selection static \
    --mask-other 0 \
    --mask-length 10 \
    --mask-prob 0.75 \
    --layerdrop 0.05 \
    --mask-channel-selection static \
    --mask-channel-other 0 \
    --mask-channel-length 64 \
    --mask-channel-prob 0.256 \
    --zero-infinity \
    --feature-grad-mult 0.0 \
    --freeze-finetune-updates 10000 \
    --validate-after-updates 10000 \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --adam-eps 1e-08 \
    --lr 1e-04 \
    --lr-scheduler tri_stage \
    --warmup-steps 1300 \
    --hold-steps 5200 \
    --decay-steps 6500 \
    --final-lr-scale 0.05 \
    --final-dropout 0.0 \
    --dropout 0.0 \
    --activation-dropout 0.1 \
    --criterion ctc \
    --attention-dropout 0.0 \
    --max-tokens 1280000 \
    --seed 2337 \
    --log-format json \
    --log-interval 500 \
    --ddp-backend no_c10d

However, the valid_raw_wer and valid_wer never goes down during training and stays at around 99-100%. valid_uer decrease until ~78% ang goes up again even though the loss keep decreasing.

Both the lexicon and the language model seem fine as I used them for testing the model already fine tuned and got similar result then reported in the paper.

An example of log:

2020-10-02 12:23:48 | INFO | fairseq.trainer | begin training epoch 40
2020-10-02 12:23:55 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-10-02 12:24:01 | INFO | valid | {"epoch": 40, "valid_loss": "1942.87", "valid_ntokens": "5368", "valid_nsentences": "30", "valid_nll_loss": "10.858", "valid_uer": "79.49", "valid_wer": "99.797", "valid_raw_wer": "111.663", "valid_wps": "0", "valid_wpb": "5368", "valid_bsz": "30", "valid_num_updates": "317", "valid_best_wer": "99.696"}
2020-10-02 12:24:01 | INFO | fairseq_cli.train | begin save checkpoint

2020-10-02 12:24:04 | INFO | fairseq_cli.train | end of epoch 40 (average epoch stats below)
2020-10-02 12:24:04 | INFO | train | {"epoch": 40, "train_loss": "2211.27", "train_ntokens": "6130.5", "train_nsentences": "32", "train_nll_loss": "11.542", "train_wps": "3088.9", "train_ups": "0.5", "train_wpb": "6130.5", "train_bsz": "32", "train_num_updates": "317", "train_lr": "2.51408e-05", "train_gnorm": "645.876", "train_loss_scale": "16", "train_train_wall": "1", "train_wall": "684"}

Do you have any idea where I should look at ? Thank you for your help

alexeib commented 3 years ago

are you finetuning a large model (based on your --max-tokens)? you need to do it on 24 gpus. or you can do it on 4 or 5 (similar to 6 you are using here), but then you need to set --update-freq 5

for smaller model you use a batch size 3x bigger and then you need 8 gpus (or 2 with --update-freq 4)

ezerhouni commented 3 years ago

I am fine tuning the small one. I will retry adding the following: --distributed-world-size 4 --update-freq 2 (8/4) --max-tokens 1280000*3

Am I correct ?

alexeib commented 3 years ago

yes. if it still doesnt work try printing examples from ctc.py

ezerhouni commented 3 years ago

I used the following parameters : --distributed-world-size 2 --update-freq 4 --max-tokens 1280000*3

I am still having the same issue. Predicting on a small set of Librispeech:

{'1255-90413-0013': 'NO',
 '1255-90413-0011': 'NO',
 '1255-90413-0021': 'NO',
 '1255-90413-0019': 'NO',
 '1255-90413-0005': 'IN NO',
 '1255-90413-0024': 'IN NO',
 '1255-90413-0017': 'NO',
 '1255-90413-0004': 'NO',
 '1255-90413-0002': 'NOW NO NO',
 '1255-90413-0020': 'AN',
 '1255-90413-0023': 'NO',
 '1255-90413-0015': 'NO NO',
 '1255-90413-0006': 'NO',
 '1255-90413-0009': 'NO NO',
 '1255-90413-0026': 'IN NO',
 '1255-90413-0028': 'NO NO',
 '1255-90413-0025': 'NO NO',
 '1255-90413-0000': 'NO',
 '1255-90413-0027': 'NO NO',
 '1255-90413-0012': 'IN NO'}
alexeib commented 3 years ago

if you share your entire training log i can take a look quickly. you can also get the params that my models were finetuned with from the checkpoints. just torch.load() the checkpoint (which is a python dictionary) and look in at the "args" key

ezerhouni commented 3 years ago

I will have a look. Please find enclosed my log (I stopped after 147 epochs as valid_uer was increasing) log-wav2vec-finetuned-small.txt

Thank you very much for your help

alexeib commented 3 years ago

so you only trained for 436 updates in that log. one of the params is --freeze-finetuning-updates. it is set to 10k by default and all it does is train a linear layer on top of the model (that by itself does not lead to any meaningful results). then the rest of the 3k updates are used to finetune the entire network.

in practice this freezing makes results very slighly more accurate, but you can also just set it to 0 so it starts finetuning right away (you may have to lower LR a bit, or not). or you can train it all the way to 13k updates (set --validate-after-updates to 10k or so, because otherwise you will be validating every epoch and it will take days to train)

ezerhouni commented 3 years ago

I set --freeze-finetuning-updates 0 and will let it run overnight. Thank you very much for your help.

ArtemisZGL commented 3 years ago

@alexeib I am sorry to ask a stupid question, is that the "max-tokens" control the batch size while training?

And When I evaluate the finetune model according to https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md, I met some error like 'conflicting option string: --lm-weight'. After I remove one of the lm-score in add_parser, I met "unrecognized arguments: --post-process letter" error again. And I remove the --post-process in cmd, met the error 'Wav2VecCtc' object has no attribute 'max_decoder_positions', is that I miss something?

Besides, if I only change the word_size, it will occur the error "AssertionError: Default process group is not initialized", and I found no where to set the distributed_init_method and distributed_num_procs.

alexeib commented 3 years ago

its the number of frames per batch. we typically do dynamic batching in fairseq meaning that the batch size is determined by total length of sequence as opposed to number of examples. but you can also use --batch-size to do traditional batching.

the errors you see are due to some recent changes we have been making to fairseq arch. the first one should be fixed. you can replace --post-process by --remove-bpe for now. i will fix it tomorrow

sythello commented 3 years ago

@ArtemisZGL Same issues here! Were you able to resolve the issue, 'Wav2VecCtc' object has no attribute 'max_decoder_positions'? I'm still stuck there.

Thanks!

alexeib commented 3 years ago

can you share your command?

sythello commented 3 years ago

@alexeib Here's my command:

subset=dev-clean
tsv_dir=~/Deep-Learning/Dataset/LibriSpeech/wav2vec2   # contains the letter dict and word dict files, and preprocessed tsv, ltr and wrd files
model_path=~/Deep-Learning/Repos/fairseq/pretrained-models/wav2vec2/wav2vec_small_100h.pt
results_dir=~/Deep-Learning/Repos/fairseq/my/results
kenlm_path=~/Deep-Learning/Repos/kenlm/build/bin

python examples/speech_recognition/infer.py $tsv_dir --task audio_pretraining \
--nbest 1 --path $model_path --gen-subset $subset --results-path $results_dir --w2l-decoder kenlm \
--lm-model $kenlm_path --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--remove-bpe letter
ArtemisZGL commented 3 years ago

@sythello sorry, I did not have solved that problem yet, and I just use the valid function in train.py to validate the model.

jiangtaojy commented 3 years ago

I set --freeze-finetuning-updates 0 and will let it run overnight. Thank you very much for your help.

@ezerhouni hi, i meet the same question as you, have you sovled it?

gauravgund commented 3 years ago

Hi @alexeib,

I am following the recent implementation of wav2vec2 for fine-tuning: https://huggingface.co/blog/fine-tune-wav2vec2-english

Settings: Pretrained model: "facebook/wav2vec2-base-960h", gradient_checkpointing=True, ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id attention_dropout=0.1, hidden_dropout=0.1, feat_proj_dropout=0.0, mask_time_prob=0.05, layerdrop=0.1, gradient_checkpointing=True, ctc_loss_reduction="mean", group_by_length=True, per_device_train_batch_size=32, evaluation_strategy="steps", num_train_epochs=1500, fp16=True, save_steps=400, #this would mean every 400 steps model gets saved which also means Google drive gets full eval_steps=400, logging_steps=400, learning_rate=0.0005, warmup_steps=500, save_total_limit=2,

Issue:

Step Training Loss Validation Loss Wer Runtime Samples Per Second
400 5.063200 4.566135 1.000000 0.715900 6.984000
800 5.115200 4.514411 1.000000 0.732400 6.827000
1200 5.119200 4.485986 1.000000 0.724300 6.903000

**The training loss is marginally decreasing and WER is still 1. What can be done to improve and faster training with better accuracy.

I also tried with a higher learning rate but training loss was still very poor, it seems the model is not converging.**

Regards.

harunuz commented 3 years ago

Hi,

I also encountered the same problem. While trying to fine-tune the base model on custom dataset, no matter what I do WER and UER does not decrease.

Here is details of the setup I used,

You can find the training config and train log files of one of the many trials below.

hydra_train.log config.txt

Please share if you have any idea about what is going on. If you need any further information just ask me, I will provide it.

Thanks for your consideration.

alexeib commented 3 years ago

Hi,

I also encountered the same problem. While trying to fine-tune the base model on custom dataset, no matter what I do WER and UER does not decrease.

  • --freeze-finetuning-updates is already set to 0
  • tried cosine lr scheduler (with this the UER was fluctuating between 97 and 100)
  • changed learning rate and update freq
  • tried to debug valid_step, model only predicts blank token "|"
  • tried different parameter setups for wav2vec_ctc model, such as dropout rates, mask probabilities, mask lengths
  • tried on different subsets of my custom dataset to see if the issue is data related

Here is details of the setup I used,

  • fairseq version v0.10.2 (build by cloning and pip install --editable)
  • pytorch 1.7.1
  • cuda 10.1
  • 1 Titan RTX 24 GB
  • python 3.8.10
  • os: Ubuntu 18.04

You can find the training config and train log files of one of the many trials below.

hydra_train.log config.txt

Please share if you have any idea about what is going on. If you need any further information just ask me, I will provide it.

Thanks for your consideration.

there is something wrong with your finetuning dataset. you have extremely low nll_loss (i.e. loss normalized by token) right from the beginning. usually this loss starts high and decreases (for letter based models you start getting < 100 results after it decreases around below 4). did you split your examples into letters with spaces in between (and word boundary tokens between words)?

harunuz commented 3 years ago

Yes, I prepared .ltr files according to libri_labes.py's outputs. You can see the structure below.

b i r | t e r s l i k | v a r d ı | v e | b u | ş u | a n k i | t e k n o l o j i y l e | n e d e n i y l e | y ı l d a | o r t a l a m a | a l t ı | a r t a r a k | d e v a m | e d i y o r | ş i m d i | i l k | o l a r a k | s i z e | d ü n y a n ı n | i l k | o n | ü l k e s i | a r a s ı n a |

By the way, since I have been using task.labels: ltr, I did not prepare .wrd files, and the training code does not give me any warning.

The dataset I am using is a mixed files from open source (common voice, vox and openSLR) Turkish speech to text datasets. All preprocessed into mono channel 16000 Hz wav files which was the same setup for the pretraining data as well.

What I don't understand is that training loss seems to decrease meaning it converges to somewhere. But that "somewhere" is not a happy place for validation data. I also tried to use only Common Voice data for both training and validation just to make sure train and validation data come from the same distribution, it was not effective either.

rcoulter13 commented 3 years ago

Hello,

I also am experiencing this problem. I am trying to fine-tune the Wav2Vec 2.0 Large (960h) on a small technical domain in English using the pre-trained libri960_big.pt model.

I cannot get my valid WER nor raw valid WER to decrease and my UER goes between 96 and 99 consistently.

Like @harunuz, I also did the following:

I also re-recorded my audio files in a quieter setting using Audacity with the sampling rate set to 16K and mono-channel (in contrast to the original audio files that had to be re-sampled and set from stereo to mono channel using SoX). I re-ran again with this new data and had the same results.

I use Google Colab Pro for my setup: fairseq v0.10.2 (also built by cloning and pip install --editable) pytorch 1.9.0+cu102 CUDA 11.2 Tesla V100 python 3.7.10

I used the base_10m config from fairseq's finetuning configs with the following changes for training:

!fairseq-hydra-train \ distributed_training.distributed_world_size=1 \ task.data='/path/to/manifest_file' \ model.w2v_path='/path/to/libri960_big.pt' \ model.freeze_finetune_updates=0 \ checkpoint.save_dir='/path/to/save/directory' \ dataset.validate_interval=5 \ dataset.validate_after_updates=0 \ dataset.valid_subset='valid' \ dataset.batch_size=1 \ dataset.num_workers=1 \ optimization.max_epoch=30 \ +optimization.update_freq='[24]' \ --config-dir /content/fairseq/examples/wav2vec/config/finetuning \ --config-name base_10m \

Here is my training log: hydra_train.log

Note that the distributed_world_size, batch_size, num_workers, update_freq, and validation interval are all set to conserve memory as Colab is prone to OOM errors with only one GPU. I only train to 30 epochs because the model gets stuck predicting only word delimiter tokens at about 26 epochs and no improvement is made after that.

Like harunuz, I have low nll_loss from the beginning. I double-checked my .ltr, .wrd, dict.ltr.txt, and tsv files to make sure I hadn't incorrectly preprocessed my data. Everything looks good from my perspective.

dict.ltr.txt: A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 ' 27 | 28

29 Sample from train.ltr: A L R I G H T | T H E | T I M E | I S | A P P R O X I M A T E L Y | Sample from valid.ltr: I ' M | G O I N G | T O | Y O U | K N O W | S T E P | O N E | H E R E | I S | T O | R E M O V E | I T | Sample from train.wrd: ALRIGHT THE TIME IS APPROXIMATELY Sample from valid.wrd: I'M GOING TO YOU KNOW STEP ONE HERE IS TO REMOVE IT I agree with harunuz on the confusion regarding the decrease in training loss. The only possibility I can imagine behind it is that the model is more focused on bringing down the training loss than accuracy so it has learned that single characters and word delimiters cost less than other, longer guesses (in the sense of less insertions, deletions, etc.), leading to the predictions I'm seeing during training. This is just a theory, I could be very wrong about this. Do I merely need more data? Is there a quality other than sampling rate and the number of channels I should check with my audio data? I've already cleaned my transcriptions and so I know that there are no unusual characters there and that the transcriptions match the audio files. Is the problem with how I set up validation? I feel like I am missing one small detail that is causing this problem but I can't find what it is. Thank you in advance for any help you can provide! I have been struggling with this for awhile.
alexeib commented 3 years ago

Yes, I prepared .ltr files according to libri_labes.py's outputs. You can see the structure below.

b i r | t e r s l i k | v a r d ı | v e | b u | ş u | a n k i | t e k n o l o j i y l e | n e d e n i y l e | y ı l d a | o r t a l a m a | a l t ı | a r t a r a k | d e v a m | e d i y o r | ş i m d i | i l k | o l a r a k | s i z e | d ü n y a n ı n | i l k | o n | ü l k e s i | a r a s ı n a |

By the way, since I have been using task.labels: ltr, I did not prepare .wrd files, and the training code does not give me any warning.

The dataset I am using is a mixed files from open source (common voice, vox and openSLR) Turkish speech to text datasets. All preprocessed into mono channel 16000 Hz wav files which was the same setup for the pretraining data as well.

What I don't understand is that training loss seems to decrease meaning it converges to somewhere. But that "somewhere" is not a happy place for validation data. I also tried to use only Common Voice data for both training and validation just to make sure train and validation data come from the same distribution, it was not effective either.

best advice i can offer is try to carefully examine your finetuning setup - maybe print some examples from ctc.py. There is something wrong, the loss values are super low and should be much higher. it is like everything is mapped to a single token (like "unk") and thats the only thing the model predicts. you sure your dict and labels match ? case wise etc

alexeib commented 3 years ago

Hello,

I also am experiencing this problem. I am trying to fine-tune the Wav2Vec 2.0 Large (960h) on a small technical domain in English using the pre-trained libri960_big.pt model.

I cannot get my valid WER nor raw valid WER to decrease and my UER goes between 96 and 99 consistently.

Like @harunuz, I also did the following:

  • --freeze-finetuning-updates set to 0
  • tried cosine lr scheduler (no improvement)
  • changed learning rate and update freq (no improvement)
  • printed results from ctc.py for debugging and my model is mostly predicting the tokens 'H', 'L', and '' which eventually, as training continues, devolves to just word delimiters '|'.
  • tried various parameter setups but no change is detected

I also re-recorded my audio files in a quieter setting using Audacity with the sampling rate set to 16K and mono-channel (in contrast to the original audio files that had to be re-sampled and set from stereo to mono channel using SoX). I re-ran again with this new data and had the same results.

I use Google Colab Pro for my setup: fairseq v0.10.2 (also built by cloning and pip install --editable) pytorch 1.9.0+cu102 CUDA 11.2 Tesla V100 python 3.7.10

I used the base_10m config from fairseq's finetuning configs with the following changes for training:

!fairseq-hydra-train distributed_training.distributed_world_size=1 task.data='/path/to/manifest_file' model.w2v_path='/path/to/libri960_big.pt' model.freeze_finetune_updates=0 checkpoint.save_dir='/path/to/save/directory' dataset.validate_interval=5 dataset.validate_after_updates=0 dataset.valid_subset='valid' dataset.batch_size=1 dataset.num_workers=1 optimization.max_epoch=30 +optimization.update_freq='[24]' --config-dir /content/fairseq/examples/wav2vec/config/finetuning --config-name base_10m \

Here is my training log: hydra_train.log

Note that the distributed_world_size, batch_size, num_workers, update_freq, and validation interval are all set to conserve memory as Colab is prone to OOM errors with only one GPU. I only train to 30 epochs because the model gets stuck predicting only word delimiter tokens at about 26 epochs and no improvement is made after that.

Like harunuz, I have low nll_loss from the beginning. I double-checked my .ltr, .wrd, dict.ltr.txt, and tsv files to make sure I hadn't incorrectly preprocessed my data. Everything looks good from my perspective.

dict.ltr.txt: A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 ' 27 | 28 29

Sample from train.ltr: A L R I G H T | T H E | T I M E | I S | A P P R O X I M A T E L Y |

Sample from valid.ltr: I ' M | G O I N G | T O | Y O U | K N O W | S T E P | O N E | H E R E | I S | T O | R E M O V E | I T |

Sample from train.wrd: ALRIGHT THE TIME IS APPROXIMATELY

Sample from valid.wrd: I'M GOING TO YOU KNOW STEP ONE HERE IS TO REMOVE IT

I agree with harunuz on the confusion regarding the decrease in training loss. The only possibility I can imagine behind it is that the model is more focused on bringing down the training loss than accuracy so it has learned that single characters and word delimiters cost less than other, longer guesses (in the sense of less insertions, deletions, etc.), leading to the predictions I'm seeing during training. This is just a theory, I could be very wrong about this.

Do I merely need more data? Is there a quality other than sampling rate and the number of channels I should check with my audio data? I've already cleaned my transcriptions and so I know that there are no unusual characters there and that the transcriptions match the audio files. Is the problem with how I set up validation? I feel like I am missing one small detail that is causing this problem but I can't find what it is.

Thank you in advance for any help you can provide! I have been struggling with this for awhile.

you only have 10 examples.. how long is each example?

you dont need update freq = 24 because you probably fit the entire training set in a single example.

are you sure your labels are in the same order as audio in your tsv file?

have you tried finetuning a 10m/5m/1m subsets from libri-light? those should work out of the box

your nll loss is > 14 which is what one would expect to start at, except its not going down which means the model is not learning anything

rcoulter13 commented 3 years ago

Hello, I also am experiencing this problem. I am trying to fine-tune the Wav2Vec 2.0 Large (960h) on a small technical domain in English using the pre-trained libri960_big.pt model. I cannot get my valid WER nor raw valid WER to decrease and my UER goes between 96 and 99 consistently. Like @harunuz, I also did the following:

  • --freeze-finetuning-updates set to 0
  • tried cosine lr scheduler (no improvement)
  • changed learning rate and update freq (no improvement)
  • printed results from ctc.py for debugging and my model is mostly predicting the tokens 'H', 'L', and '' which eventually, as training continues, devolves to just word delimiters '|'.
  • tried various parameter setups but no change is detected

I also re-recorded my audio files in a quieter setting using Audacity with the sampling rate set to 16K and mono-channel (in contrast to the original audio files that had to be re-sampled and set from stereo to mono channel using SoX). I re-ran again with this new data and had the same results. I use Google Colab Pro for my setup: fairseq v0.10.2 (also built by cloning and pip install --editable) pytorch 1.9.0+cu102 CUDA 11.2 Tesla V100 python 3.7.10 I used the base_10m config from fairseq's finetuning configs with the following changes for training: !fairseq-hydra-train distributed_training.distributed_world_size=1 task.data='/path/to/manifest_file' model.w2v_path='/path/to/libri960_big.pt' model.freeze_finetune_updates=0 checkpoint.save_dir='/path/to/save/directory' dataset.validate_interval=5 dataset.validate_after_updates=0 dataset.valid_subset='valid' dataset.batch_size=1 dataset.num_workers=1 optimization.max_epoch=30 +optimization.update_freq='[24]' --config-dir /content/fairseq/examples/wav2vec/config/finetuning --config-name base_10m \ Here is my training log: hydra_train.log Note that the distributed_world_size, batch_size, num_workers, update_freq, and validation interval are all set to conserve memory as Colab is prone to OOM errors with only one GPU. I only train to 30 epochs because the model gets stuck predicting only word delimiter tokens at about 26 epochs and no improvement is made after that. Like harunuz, I have low nll_loss from the beginning. I double-checked my .ltr, .wrd, dict.ltr.txt, and tsv files to make sure I hadn't incorrectly preprocessed my data. Everything looks good from my perspective. dict.ltr.txt: A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 ' 27 | 28 29 Sample from train.ltr: A L R I G H T | T H E | T I M E | I S | A P P R O X I M A T E L Y | Sample from valid.ltr: I ' M | G O I N G | T O | Y O U | K N O W | S T E P | O N E | H E R E | I S | T O | R E M O V E | I T | Sample from train.wrd: ALRIGHT THE TIME IS APPROXIMATELY Sample from valid.wrd: I'M GOING TO YOU KNOW STEP ONE HERE IS TO REMOVE IT I agree with harunuz on the confusion regarding the decrease in training loss. The only possibility I can imagine behind it is that the model is more focused on bringing down the training loss than accuracy so it has learned that single characters and word delimiters cost less than other, longer guesses (in the sense of less insertions, deletions, etc.), leading to the predictions I'm seeing during training. This is just a theory, I could be very wrong about this. Do I merely need more data? Is there a quality other than sampling rate and the number of channels I should check with my audio data? I've already cleaned my transcriptions and so I know that there are no unusual characters there and that the transcriptions match the audio files. Is the problem with how I set up validation? I feel like I am missing one small detail that is causing this problem but I can't find what it is. Thank you in advance for any help you can provide! I have been struggling with this for awhile.

you only have 10 examples.. how long is each example?

you dont need update freq = 24 because you probably fit the entire training set in a single example.

are you sure your labels are in the same order as audio in your tsv file?

have you tried finetuning a 10m/5m/1m subsets from libri-light? those should work out of the box

your nll loss is > 14 which is what one would expect to start at, except its not going down which means the model is not learning anything

Each example is about 10-25 seconds. I also got more recordings and ran it with the larger data set but still had the same results.

Ah okay, I misunderstood the function of update freq then. Thank you for the clarification.

Yes, I double-checked my labels to make sure they are in the same order as in the tsv file and they all are.

I hadn't but I did try them (libri-light's 1hr/10m/5min/1m) this week and I got the same results. I think I figured out the problem though. I believe there are several dependency issues in my installation of flashlight's python bindings (previously wav2letter) in Colab and so it is not doing any decoding during training. I'm actually not sure these python bindings can be successfully installed in Colab due to its environment set up, but I'll keep trying. I apologize for wasting your time with such an unrelated problem. Thank you for your help!

harunuz commented 3 years ago

Yes, I prepared .ltr files according to libri_labes.py's outputs. You can see the structure below.

b i r | t e r s l i k | v a r d ı | v e | b u | ş u | a n k i | t e k n o l o j i y l e | n e d e n i y l e | y ı l d a | o r t a l a m a | a l t ı | a r t a r a k | d e v a m | e d i y o r | ş i m d i | i l k | o l a r a k | s i z e | d ü n y a n ı n | i l k | o n | ü l k e s i | a r a s ı n a |

By the way, since I have been using task.labels: ltr, I did not prepare .wrd files, and the training code does not give me any warning.

The dataset I am using is a mixed files from open source (common voice, vox and openSLR) Turkish speech to text datasets. All preprocessed into mono channel 16000 Hz wav files which was the same setup for the pretraining data as well.

What I don't understand is that training loss seems to decrease meaning it converges to somewhere. But that "somewhere" is not a happy place for validation data. I also tried to use only Common Voice data for both training and validation just to make sure train and validation data come from the same distribution, it was not effective either.

For the record, I finally figured out what is wrong with my experiments. It was neither the finetuning data, nor the installation mistakes. It was the pretraining itself. I realized that my pretrained model was a disaster. I had got over almost 98% accuracy on validation data in pretraining. However, my dataset was not well-prepared. The sound files had long silences in them and the validation split was not a very good evaluation material.

I tried finetuning the proposed large model that was pretrained on multilingual dataset. And in like 30k updates, I got pretty good results.

Much thanks for your time though @alexeib

RubensZimbres commented 2 years ago

This also happened to me when I forgot to put a space in the vocabulary:

vocab=['a','b','c',.......]

to

vocab=[' ','a','b','c',.......]