A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
batch size: 8
att_context_size: [70,1] because I need to use this model for real time transcription. Experimented with the Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb and found out that [70,1] is what gives best latency for my use case of real time transcription (lookead_size = 80ms and encoder_step_length = 80ms)
learning_rate: Tried with different lr (0.005, 0.0005, 0.0001) but not difference.
min_duration = 5s
max_duration = 20s
Tokenizer:
Used default tokenizer, saved the default tokenizer from the model and then used it for finetuning. (1024 SPE Unigram)
Dataset:
train: 120hrs of english audio (doctor patient conversations (recordings of meetings), contains medical jargon and american accent), audio files: 5s-20s
converted all transcripts to only contain lowercase english alphabets, space and apostrophe
val: 15hrs of audio in same format as above
WER of Original Model on val_set = 41%
The WER reaches 22-25% in all runs within first 10-15 epochs and after that it remains almost constant. Max epochs tried is 90.
GPU: V100 16GB
Tensorboard Logs of one of the runs:
UPDATE (30/09/2024)
Tried adding more data to the dataset. Whole dataset is shuffled and then train and val sets are created. Didn't make any difference and instead now I have a bigger lowest VAL_WER with the new daataset.
TRAIN: 220 hrs (~100 hrs more from previous data)
VAL: 17 hrs
Audio Lengths: 1s - 20s (1s 2s 3s 4s ....... 18s 19s 20s)
learning rate: 0.001
Tried with Batch size: 8 (V100), 32 (A100)
Epochs: 45 (V100), 95 and 300 (A100)
VAL_WER: ~23-24% in all 3 runs
For some reason it looks like the learning rate becomes equal to minimum learning rate (1e-6) right from the start and then stays there. Not sure why. Also don't know why training loss graph looks like as shown below.
Also I am not sure what we are training here? CTC or RNNT or both with the config I am using. (I use decoder_type = "rnnt" when doing inference using this model)
Find tensorboard logs attached below
UPDATE (3/10/2024)
I read in few answers in the issues that when we use NoamAnnealing which is the default scheduler mentioned in config (fastconformer_hybrid_transducer_ctc_bpe_streaming.yaml), the lr is a multiplier and hence should be 1/10th or 1/5th of original
I tried with lr = 0.5 (1 GPU A100 batch_size=32), 1.0 (4 GPUs A100 batch_size=32)
Min WER = 22% (44 epochs 1 GPU, 175 epochs 4 GPUs)
and it seems almost stuck and doesn't go below this wer with new learning rates also
Apart from this, also tried below changes which also didn't make any difference
att_context_size = [70,6] instead of [70,1]
accumulate_grad_batches = 8 instead of 1
@nithinraok can you please have a look into this and provide any suggestions or any insights that you have based on the provided details.
Thank You
Attaching the logs from latest runs
@nithinraok @titu1994 @elliottnv Any help/suggestions would be appreciated. Thank You.
I am trying to finetune STT En FastConformer Hybrid Transducer-CTC Large Streaming Multi model.
Issue: WER does not go below 22.5%. Need suggestion of how I can improve this or anything that I might be doing wrong.
Details:
Config:
Things changed from default config:
batch size: 8 att_context_size: [70,1] because I need to use this model for real time transcription. Experimented with the Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb and found out that [70,1] is what gives best latency for my use case of real time transcription (lookead_size = 80ms and encoder_step_length = 80ms) learning_rate: Tried with different lr (0.005, 0.0005, 0.0001) but not difference. min_duration = 5s max_duration = 20s
Tokenizer:
Used default tokenizer, saved the default tokenizer from the model and then used it for finetuning. (1024 SPE Unigram)
Dataset:
train: 120hrs of english audio (doctor patient conversations (recordings of meetings), contains medical jargon and american accent), audio files: 5s-20s converted all transcripts to only contain lowercase english alphabets, space and apostrophe val: 15hrs of audio in same format as above
WER of Original Model on val_set = 41% The WER reaches 22-25% in all runs within first 10-15 epochs and after that it remains almost constant. Max epochs tried is 90.
GPU: V100 16GB
Tensorboard Logs of one of the runs:
UPDATE (30/09/2024)
Tried adding more data to the dataset. Whole dataset is shuffled and then train and val sets are created. Didn't make any difference and instead now I have a bigger lowest VAL_WER with the new daataset.
TRAIN: 220 hrs (~100 hrs more from previous data) VAL: 17 hrs Audio Lengths: 1s - 20s (1s 2s 3s 4s ....... 18s 19s 20s) learning rate: 0.001 Tried with Batch size: 8 (V100), 32 (A100) Epochs: 45 (V100), 95 and 300 (A100) VAL_WER: ~23-24% in all 3 runs
For some reason it looks like the learning rate becomes equal to minimum learning rate (1e-6) right from the start and then stays there. Not sure why. Also don't know why training loss graph looks like as shown below. Also I am not sure what we are training here? CTC or RNNT or both with the config I am using. (I use decoder_type = "rnnt" when doing inference using this model)
Find tensorboard logs attached below
UPDATE (3/10/2024)
I read in few answers in the issues that when we use NoamAnnealing which is the default scheduler mentioned in config (fastconformer_hybrid_transducer_ctc_bpe_streaming.yaml), the lr is a multiplier and hence should be 1/10th or 1/5th of original
I tried with lr = 0.5 (1 GPU A100 batch_size=32), 1.0 (4 GPUs A100 batch_size=32) Min WER = 22% (44 epochs 1 GPU, 175 epochs 4 GPUs) and it seems almost stuck and doesn't go below this wer with new learning rates also
Apart from this, also tried below changes which also didn't make any difference att_context_size = [70,6] instead of [70,1] accumulate_grad_batches = 8 instead of 1
@nithinraok can you please have a look into this and provide any suggestions or any insights that you have based on the provided details. Thank You
Attaching the logs from latest runs
@nithinraok @titu1994 @elliottnv Any help/suggestions would be appreciated. Thank You.