Closed othrif closed 3 years ago
Current development branch: https://github.com/ARBML/klaam/tree/othrif/feat/ma
@zaidalyafeai
Duration is fixed to less than 15s
I would only include the first n seconds of an audio file, For example in ADI-5 I only utilize the first 20 seconds for classification.
https://github.com/ARBML/klaam/blob/330676a562222a10f40faa10a6556c1e5b0d0542/run_classifier.py#L334
@othrif doesn't MGB3 contain time stamps of 8 seconds ?
So the histogram above shows the distribution of durations of the audio clips in seconds, as you can see, almost all clips have length less than 15seconds. One cannot really take just the first 20 seconds, since there are only about 100 files and many of them are one hour long or more. Using the provided split of timestamps is the way to go.
here is my current implementation: https://github.com/ARBML/klaam/blob/othrif/feat/ma/run_recognition.py#L401-L412 which is very similar to what you have.
That makes sense, I confused the classification task vs recognition task. This implementation should not cause out of memory errors, right ?
yes correct! however, in training I am running into a NaN
validation loss. This i haven't solved yet.
I remember some folks on slack talking about some corrupted files causing such errors. Maybe first try limiting the datasets using max_train_samples
and max_val_samples
.
unfortunately, that did not work. I will test with Buckwalter if that might help.
I will debug in the weekend.
@othrif which dataset gives you NaNs ? I tested with mgb3 and I get good results at least with 100 train samples.
{'eval_loss': 12.992751121520996, 'eval_wer': 0.9994222992489891, 'eval_runtime': 11.8918, 'eval_samples_per_second': 8.409, 'epoch': 0.57}
Can you try the following script
python run_mgb3.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
--output_dir=out_dir \
--cache_dir=cache_dir \
--freeze_feature_extractor \
--num_train_epochs="50" \
--per_device_train_batch_size="16" \
--preprocessing_num_workers="1" \
--learning_rate="3e-5" \
--warmup_steps="20" \
--evaluation_strategy="steps"\
--save_steps="1" \
--eval_steps="1" \
--save_total_limit="1" \
--logging_steps="100" \
--do_eval \
--do_train \
--max_train_samples 100 \
--max_val_samples 100 \
The problem seems to happen in mgb-5. Setting ctc_zero_infinity=True
seems to resolve the problem. I added run_mgb5.py
file.
@zaidalyafeai you won't see it with 100 samples, try increasing it to 1000 or 10,000. I tried checking manually to see if there is a corrupt file perhaps that is causing the NaN but all files seemed good.
Update yes ctc_zero_infinity=True
fixes the issue with a smaller run, running for longer now! The architecture has a lot of knobs to turn!
Documentation for anyone interested: https://huggingface.co/transformers/model_doc/wav2vec2.html
Closing this since training is in progress!
The audio in two of the datasets we are using (MGB3 and MGB5) come in long sequences of tens of minutes. This is impractical to use with any GPU for training. Longer sequences of audio will result in out of memory errors in GPUs even with a small batch size.
The solution is to split the audio into smaller audio segments of 15 to 30 seconds depending on the hardware used (GPU memory to a large extent).
This issue is to track adding a functionality to split the audio into smaller chunks that can fit into a GPU.