ARBML / klaam

Arabic speech recognition, classification and text-to-speech.
MIT License
367 stars 72 forks source link

Functionality to split/align audio segments for training #2

Closed othrif closed 3 years ago

othrif commented 3 years ago

The audio in two of the datasets we are using (MGB3 and MGB5) come in long sequences of tens of minutes. This is impractical to use with any GPU for training. Longer sequences of audio will result in out of memory errors in GPUs even with a small batch size.

The solution is to split the audio into smaller audio segments of 15 to 30 seconds depending on the hardware used (GPU memory to a large extent).

This issue is to track adding a functionality to split the audio into smaller chunks that can fit into a GPU.

othrif commented 3 years ago

Current development branch: https://github.com/ARBML/klaam/tree/othrif/feat/ma

@zaidalyafeai

othrif commented 3 years ago

Duration is fixed to less than 15s duration_hist

zaidalyafeai commented 3 years ago

I would only include the first n seconds of an audio file, For example in ADI-5 I only utilize the first 20 seconds for classification.

https://github.com/ARBML/klaam/blob/330676a562222a10f40faa10a6556c1e5b0d0542/run_classifier.py#L334

zaidalyafeai commented 3 years ago

@othrif doesn't MGB3 contain time stamps of 8 seconds ?

othrif commented 3 years ago

So the histogram above shows the distribution of durations of the audio clips in seconds, as you can see, almost all clips have length less than 15seconds. One cannot really take just the first 20 seconds, since there are only about 100 files and many of them are one hour long or more. Using the provided split of timestamps is the way to go.

here is my current implementation: https://github.com/ARBML/klaam/blob/othrif/feat/ma/run_recognition.py#L401-L412 which is very similar to what you have.

zaidalyafeai commented 3 years ago

That makes sense, I confused the classification task vs recognition task. This implementation should not cause out of memory errors, right ?

othrif commented 3 years ago

yes correct! however, in training I am running into a NaN validation loss. This i haven't solved yet.

zaidalyafeai commented 3 years ago

I remember some folks on slack talking about some corrupted files causing such errors. Maybe first try limiting the datasets using max_train_samples and max_val_samples.

othrif commented 3 years ago

unfortunately, that did not work. I will test with Buckwalter if that might help.

zaidalyafeai commented 3 years ago

I will debug in the weekend.

zaidalyafeai commented 3 years ago

@othrif which dataset gives you NaNs ? I tested with mgb3 and I get good results at least with 100 train samples.

{'eval_loss': 12.992751121520996, 'eval_wer': 0.9994222992489891, 'eval_runtime': 11.8918, 'eval_samples_per_second': 8.409, 'epoch': 0.57}

Can you try the following script

python run_mgb3.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=out_dir \
    --cache_dir=cache_dir \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="16" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="1" \
    --eval_steps="1" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \
    --max_train_samples 100 \
    --max_val_samples 100 \
zaidalyafeai commented 3 years ago

The problem seems to happen in mgb-5. Setting ctc_zero_infinity=True seems to resolve the problem. I added run_mgb5.py file.

othrif commented 3 years ago

@zaidalyafeai you won't see it with 100 samples, try increasing it to 1000 or 10,000. I tried checking manually to see if there is a corrupt file perhaps that is causing the NaN but all files seemed good.

Update yes ctc_zero_infinity=True fixes the issue with a smaller run, running for longer now! The architecture has a lot of knobs to turn!

Documentation for anyone interested: https://huggingface.co/transformers/model_doc/wav2vec2.html

othrif commented 3 years ago

Closing this since training is in progress!