flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Training on my own data using pertained models | Fine tuning #737

Open tumusudheer opened 4 years ago

tumusudheer commented 4 years ago

Hi,

I've a my own dataset of about 300 hours, with custom words from different domain.

I want to train ASR model using TDS Seq2Seq as given here. I will split my long audio files into small ones with max length of about 15 seconds.

I can use fork command like this ./Train fork [path to old model.bin] [flags/params...] to fine-tune from given pertained models.

Since My data is from custom domain, I have some unique words and phrases. What steps do I need to perform if I change token dictionary and lexicon file ? If I change my token dictionary and lexicon file, can I still use the pertained model and fine-tune using fork command ?

Or Do I need to perform Transfer learning where I freeze some layers and train the final layers of the network ?

Please advise, Thank you.

tlikhomanenko commented 4 years ago

You have several options here:

sp = spm.SentencePieceProcessor() sp.Load("wav2letter/recipes/models/sota/2019/am/librispeech-train-all-unigram-10000.model")

new_words = ["tatiana"] to_file = []

for nbest in [10]: for word in lexicon_words: wps = sp.NBestEncodeAsPieces(word, nbest) for wp in wps: # the order matters for our training to_file.append( word

Several people tried the last option already and they reported that it is working finely.

tumusudheer commented 4 years ago

Hi @tlikhomanenko ,

Thank you very much for your help. To start with, I'll keep the same token set (dictionary) and will add extended lexicon with spelling for your words.

I'll follow the these steps:

  1. Split and align longer audio files into small chunks ( max duration of audio file <= 20 or 30 sec). I've transcriptions for these files (but they don't have word level timings). Is there a force alignment tool you would recommend that I can use from wav2letter or any other facebook project ? such as librlight

  2. Keep the same token set and expand the lexicon file from my data set (spelling of new words) using the example script you've provided.

  3. Then I can start training the AM model by using the fork option. If I want to use TDS Seq2Seq (cfg file). Any flags I need to change for fork option ? or any other parameters such as warmup (Since I found some posts where warmup needs to be changes because of new code changes ? I'll be using only one GPU, do I need to change any params such as nthreads because of this ?

  4. Also With keeping the same token set, If I want to incorporate laughter and noise as mentioned in these posts #457 and #532, how can I add the new words such as [laughter] L | to my lexicon since the token set doesn't have L ?

Thank you

lunixbochs commented 4 years ago

My wav2train project can align arbitrary audio for use with wav2letter

tumusudheer commented 4 years ago

@lunixbochs ,

Thanks , I was Able to try wav2train on small audio files and it worked. the link to download the generate_trie doesn't work any more though but I download them manually from deep speech git repo. Thank you.

@tlikhomanenko ,

I'll keep same token set and expand the lexicon file from my data set (spelling of new words) using the example script you've provided. I've a few questions:

  1. By keeping the same token set, how do I incorporate laughter, cough, some background noise or background music as mentioned in #457 and #532. how can I add the new words such as [laughter] L | to my lexicon since the token set doesn't have L ? or [Noise] N |

  2. In my ground truth transcriptions, I've some abbreviations such as EMI or ABS . Is there a way I can notify them as e.m.i or some separator so output of AM/decoder will give me without breaking them into multiple words ?

  3. In some audio files, when the user says 'one hundred mg', my ground truth transcripts currently have '100 mg'. But I'll change them to 'one hundred mg'. Just want to make sure this is fine. then based on question # 2 regarding abbreviations, I can specify mg as m.g or which ever the separator you recommend.

Thank you

tlikhomanenko commented 4 years ago

For force alignment and audio split you also could have a look here https://github.com/facebookresearch/wav2letter/tree/master/tools at voice activity detection (it is similar to librilight one we used).

Regarding your questions:

  1. By keeping the same token set, how do I incorporate laughter, cough, some background noise or background music as mentioned in #457 and #532. how can I add the new words such as [laughter] L | to my lexicon since the token set doesn't have L ? or [Noise] N |

In all these case you need to extend the tokens set and thus cut the last linear layer from the network, add new one and train with it. You could at first test the transfer learning without them and then do coding with new tokens set and changed last linear layer.

  • In my ground truth transcriptions, I've some abbreviations such as EMI or ABS . Is there a way I can notify them as e.m.i or some separator so output of AM/decoder will give me without breaking them into multiple words ?

Do you have other words which are not abbreviations but with the same spelling? If no - then just add abbreviations as normal words (lower-cased) to the lexicon and tokens sequence produced by word-piece tool. Extra things can be done with preserving the uppercase for them, but here again token set will be different and LM should be trained with the same pre-processing. We also used dot as a separator between letters in the abbreviation, it is also looks fine as an approach, but you need to add "." into tokens set. Again here LM should be trained with the same preprocessing.

3. In some audio files, when the user says 'one hundred mg', my ground truth transcripts currently have '100 mg'. But I'll change them to 'one hundred mg'. Just want to make sure this is fine. then based on question # 2 regarding abbreviations, I can specify mg as m.g or which ever the separator you recommend.

Yep, I did the same thing for data preprocessing. Make sure that you compute WER with target where text instead of numbers is used.

3. Then I can start training the AM model by using the fork option. If I want to use TDS Seq2Seq (cfg file). Any flags I need to change for fork option ? or any other parameters such as warmup (Since I found some posts where warmup needs to be changes because of new code changes ? I'll be using only one GPU, do I need to change any params such as nthreads because of this ?

TDS s2s is trained without any warmup (by default it is deactivated). Warmup is only necessary for the transformer models. One thing in the config you need to fix is --iter=600 - we switched from number of epochs to number of updates. So here you need to set some large number or do computation how many updates per epoch you will have and multiply it on 600 to be set for --iter. Parameter nthread is set for each GPU (not for the total used number of GPUs) so you don't need to change it.

tumusudheer commented 4 years ago

Hi @tlikhomanenko ,

Thank you very much for the responses. I really appreciate. Thank you very much.

In all these case you need to extend the tokens set and thus cut the last linear layer from the network, add new one and train with it. You could at first test the transfer learning without them and then do coding with new tokens set and changed last linear layer.

Got it. As you suggested, I'll first use the same token set so I can just fork the pertained model and run with my new extended lexicon and finetune the whole model. In the next and subsequent trains, I'll expand token set with additional tokens. Thank you. I will use the models released as part of inference framework here http://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/ for finetuning. Or do you recommend to use the pertained models given here under Pre-trained language models ?

Do you have other words which are not abbreviations but with the same spelling? If no - then just add abbreviations as normal words (lower-cased) to the lexicon and tokens sequence produced by word-piece tool. Extra things can be done with preserving the uppercase for them, but here again token set will be different and LM should be trained with the same pre-processing. We also used dot as a separator between letters in the abbreviation, it is also looks fine as an approach, but you need to add "." into tokens set. Again here LM should be trained with the same preprocessing.

I don't have other words with the same spelling of abbreviations. Thanks I'll just add them as normal words (lower-cased) as recommended. When I expand my token set, then I'll add custom tokens such as "." or "|". Thank you

Yep, I did the same thing for data preprocessing. Make sure that you compute WER with target where text instead of numbers is used.

Sure.

Thank you very much

tlikhomanenko commented 4 years ago

You can start from our sota/2019 models to see how with the best pre-trained models you can transfer and then switch to inference model if you need it in real time and with better speed.

tumusudheer commented 4 years ago

Hi @tlikhomanenko,

I need to run the online inference with the model I trained, Seems like StreamingTDSModelConverter.cpp does not support converting SOTA models, so I started with Steaming Convnets Model.

Question1

Is it possible to use StreamingTDSModelConverter.cpp to serialize SOTA/2019 models for inference ?

Question2 Seems I need to train the decoder as well after training AM and LM ? May I know what should be the data to train decoder and if any instructions on how to train the decoder available ?

This section has a few commands, but not sure which one to use and the differences between them ?

Thank you

tlikhomanenko commented 4 years ago

Is it possible to use StreamingTDSModelConverter.cpp to serialize SOTA/2019 models for inference ?

It only support for now TDS CTC models, so you can use our sota TDS model only.

Question2

For decoder you need to find weight coefficients for the integrating LM with AM. Full doc on the decoder is here https://github.com/facebookresearch/wav2letter/wiki/Beam-Search-Decoder, also you can have a look at the appendix how we did optimization of decoder (also inside the paper it is info how we do decoding) https://openreview.net/pdf?id=OSVxDDc360z.

This section has a few commands, but not sure which one to use and the differences between them ?

There are different LMs used in the decoding, LMs are differnet in the size, so this depends on your memory restrictions in online setting. Parameters of the decoder are optimized on the validation set, so you need to have some holdout set on which you do the same.

Bernardo-Favoreto commented 4 years ago

You have several options here:

* use the same token set and just extend lexicon file with spelling for your words. Here you need to apply sentencepiece model to your words like
import sentencepiece as spm
import os

sp = spm.SentencePieceProcessor()
sp.Load("wav2letter/recipes/models/sota/2019/am/librispeech-train-all-unigram-10000.model")

new_words = ["tatiana"]
to_file = []

for nbest in [10]:
    for word in lexicon_words:
        wps = sp.NBestEncodeAsPieces(word, nbest)
        for wp in wps: # the order matters for our training
            to_file.append(
                word
                + "\t"
                + " ".join([w.replace("\u2581", "_") for w in wp])
                + "\n"
            )
# add to_file into the lexicon file

In this case you can simply use fork with this new extended lexicon and finetune the whole model

@tlikhomanenko I am currently trying to do a very similar thing, but the fine-tuning is from a model that I've already trained on my own data and then a few things came into mind:

  1. Can I simply append the new words to the existing lexicon (train, train+dev and decoding) files? I ask this because it seems like the lexicon file is sorted alphabetically and appending new words would mess this up.
  2. It is said that "the order matters for our training". Is this the order of the word pieces being generated or, as I mentioned above, the lexicon should be ordered alphabetically? The problem is when I use "sort" to sort it alphabetically the word pieces order also changes a bit so this is highly correlated to the above.
  3. Could I fine-tune with a lexicon generated solely by the new data, and then adding new words to the decoder lexicon file (and thus expanding it)?
  4. For further training (as it is in my case), "continue" or "fork" is the most appropriate choice here? I am a bit unsure, since using transformer sota models we have a training schedule, which either way influences how we continue/fork model (continue simply returns from the same epoch from which it was stopped while fork restarts from epoch 1 - and thus restart the schedule - but using an already trained model).
  5. Regarding the tokens set, should I use the same for both stages (initial + further training)? Moreover, should I create it using only a subset of data (e.g. initial training+dev+test lists), or do I have to pass both initial + fine-tuning lists (initial-train, initial-dev, initial-test, tuning-train, tuning-dev, tuning-test) when creating it?

I think figuring my sorting questions would help me a lot because I'm struggling to figure out how should I prepare/modify my data files for further training.

Sorry for the amount of questions.

Thanks!

tlikhomanenko commented 4 years ago

@Bernardo-Favoreto, you are always welcome to ask questions!

1, 2) yes, whatever order of words can be in lexicon, just preserve order of nbest wp splits for a particular word (the first one is more probable wp segmentation). So sorting can be applicable only to the words, but don't change the appearance of the segmentations for each word.

3) yep. The only restrictions is the tokens set, it cannot be changed without changing the last layer

4) better to use fork I think (there should be also some reusage of optimizer), but here you can start with the new any optimizer and use any lr settings appropriate for your model

5) depends, don't know good answer =) Depends on your data, but I would suggest to try the same tokens at first as it is simple, then switch to entire new tokens computed on your data (you need to change last layer then from forked model).

Happy to help!

Bernardo-Favoreto commented 4 years ago

Thanks @tlikhomanenko!

The only thing I didn't quite get is regarding number 4. When forking a model I should treat this as I would when starting a new training? For instance, I'm currently using transformer models. Should I use warmup when forking (and then adapting the number of steps/lr according to this new subset of data)? I think this is the case, but I'm not sure.

Also, when you say "there should be also some reusage of optimizer" you mean forking a model will have this benefit? Again, I think that's the case.

Thank you so much as always!

tlikhomanenko commented 4 years ago

When you fork it means you don't use optimizer state and just initialize network with previously trained model. So you don't need warmup, you continue training. The difference with continue that in fork you reset optimizer state. For sgd fork and continue will be the same, but for Adam for example where you store grad momentums the optimizer step will be different.

tumusudheer commented 4 years ago

Hi @tlikhomanenko,

I've prepared the the new lexicon using the existing tokens and added the new words using the script you provided here. Then I started the training using fork: ./Train fork [path to old model.bin] [flags/params...]

I'm using the v0.2 branch for both wav2letter and flashlight and using streaming_convnets recipe

But seems the training is not getting converged. My networking architecture is attached here and also the training logs. am_500ms_future_context.arch.txt Even after the epoch 15, training and decode WER is very high. My training configuration flags are as follows:

Do I need to change lr or momentum ? Also I'm not using lr_criterion, do I need to add to my cfg ? Please let me know if I'm doing something wrong ?

# Training config for Librispeech using Time-Depth Separable Convolutions
# Replace `[...]`, `[MODEL_DST]`, `[DATA_DST]`, `[DATA_DST_librilight] with appropriate paths
--runname=inference_2019
--rundir=/data/Self/facebook/work/streaming_convnets/run_1007_1/
--tokensdir=/data/Self/facebook/work/streaming_convnets/am/
--archdir=/data/Self/facebook/wav2letter_v0.2/wav2letter/recipes_master/streaming_convnets/librispeech
--train=/data/Self/facebook/work/streaming_convnets/train.lst.pruned
--valid=/data/Self/facebook/work/streaming_convnets/dev.lst.pruned
--lexicon=/data/Self/facebook/work/streaming_convnets/am/merged_v2.lexicon
--arch=am_500ms_future_context.arch
--tokens=librispeech-train-all-unigram-10000.tokens
--criterion=ctc
--batchsize=8
--lr=0.01
--momentum=0.8
--maxgradnorm=0.5
--reportiters=1000
--nthread=6
--mfsc=true
--usewordpiece=true
--wordseparator=_
--filterbanks=80
--minisz=200
--mintsz=2
--maxisz=33000
--enable_distributed=true
--pcttraineval=1
--minloglevel=0
--logtostderr
--onorm=target
--sqnorm
--localnrmlleftctx=300
--lr_decay=10000
--input=wav
--itersave=true
--iter=1000000

Please find my training logs here 001_config.txt 001_log.txt 001_perf.txt

tumusudheer commented 4 years ago

Hi @tlikhomanenko ,

My bad. The model was converging and I got WER of ~12 on Test set. (without language Model). I've used Test.cpp to evaluate the WER for the test set. Please find the learning curves attached Dev_TER_WER_LOSS (in the same order) dev_ter_wer_loss Training WER train_WER Training Loss train_loss

My training data is little noisy, so will clean it and retrain again. Also at present, I've used the same tokens given in the original streaming convnets recipie and extended the lexicon with my data but I'll also try with my custom tokens and lexicon and try transfer learning as you mentioned above.

Thank you

tlikhomanenko commented 4 years ago

Nice, what was the error that you report before that the model doesn't converge? How did you solve it?

tumusudheer commented 4 years ago

Hi @tlikhomanenko,

I did not plot the curves initially. As the training started, I observed the WER for training data is not going down for some batches. But It seems I was wrong, It was going down slowly. Also some of my training data has noise, so I'll correct it and train it.

I just want to verify if the lr = 0.1 or momentum = 0.8 are high and should I reduce them. But the training config params I posted in my previous reply worked fine and converged. Also what is --lrcrit param as I don't see that in the steaming convnets config here

To get started on the LM side, I've trained a kenlm language model (word based) using my training and dev set (a 3-gram). Then I've used SRILM to perform mix lm (as you suggested here) with Librispeechs 4-gram.arpa.lower. This is the command I've used to train my 3-gram LM

kenlm/bin/lmplz --text combined_for_lm_train_dev.pruned --arpa self_3-gram.arpa -o 3 --prune 0 0 3  --discount_fallback I was getting some error without the  --discount_fallback option so I included to unblock myself.

Since this script prepare_librispeech_wp_and_official_lexicon.py is already giving decoder lexicon, I've added my new word pieces (what ever I've added to AM lexicon) to decoder lexicon also. Without LM I'm getting WER ~ 17% and with Decoder I'm getting WER ~ 15%. Here are my decoding params:

# Decoding config for Librispeech
# Replace `[...]`, `[DATA_DST]`, `[MODEL_DST]` with appropriate paths
# for test-other (best params for dev-other)
--am=/data/Self/facebook/train_logs/wav2letter_logs/inference_2019/001_model_iter_036.bin
--tokensdir=/data/Self/facebook/train_228_1007/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/data/Self/facebook/train_228_1007/decoder/merged_decoder.lexicon
--datadir=/data/Self/facebook/train_228_1007/
--test=acc_test.list
--uselexicon=false
--sclite=/data/Self/facebook/train_228_1007/test_logs_nosynthetic_decoder_trail_self
--decodertype=wrd
--lmtype=kenlm
--silscore=0
--beamsize=500
--beamsizetoken=100
--beamthreshold=20
--nthread_decoder=16
--lmweight=0.67470637680685
--wordscore=0.62867952607587
--smearing=max
--show
--showletters

My custom language model currently is trained on my Acoustic Models training data and Dev data. And I used test data to find best mix-lm params.

I've a couple of questions:

  1. Once you have a language model that you can use for decoder, is there a way to test how effective the LM is on your test set without running the whole AM and decoder ? some kind of check to see how effective the new LM is on my test data

  2. Also I wanted to prepare a token level LM with more text data and try lexicon free approach as you suggested here: may I know how can I train a token level LM. One idea is just replace spaces in my data with | and then convert words to tokens with space like "hello world" will become "h e l l o | w o r l d". and then just train the LM for some n-gram as usual. Is this correct ?

For this approach to work, do I need to have token | in my AM token set. Currently I'm using the token set that is provided in the streaming convnets recipe so I can fine-tune the model with pre-trained published model. If I add more text data to my LM, do I need to make sure there are no characters in the text data that are not present in my token set ?

Thank you

tlikhomanenko commented 4 years ago

--lrcrit is for criterion which could have also trained params, like for ASG it is transition matrix, and for s2s models it ise AM-decoder, so for ctc this param is not used (no any learnt params in the ctc).

Answers to the questions:

  1. Yes, LMs often are tested measuring just perplexity on text, and we saw strict dependance that if LM has better perplexity it performs better with decoding (up to some perplexity, after which improvement could not give any gain in decoding, so this depends on the data). For this in kenlm you have query among bins, see example here https://github.com/facebookresearch/wav2letter/blob/master/recipes/lexicon_free/librispeech/train_ngram_lms.sh#L18, tail -n of result file will contain perplexity
  2. Yep, exactly, I also use | at the end, so "hello world" for word-lm will be "h e l l o | w o r l d |" for char lm, and then just train kenlm on this transformed text data.

For this approach to work, do I need to have token | in my AM token set. Currently I'm using the token set that is provided in the streaming convnets recipe so I can fine-tune the model with pre-trained published model. If I add more text data to my LM, do I need to make sure there are no characters in the text data that are not present in my token set ?

In this case you need to transform words not into letters sequence but word pieces, so token lm should be trained on the same tokens as AM, because we apply scoring of AM and LM to the same tokens. See example here how we did this in case of wp: https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019/lm#data-preparation.

Otherwise for letter tokens you will apply LM correctly, but for all other word-pieces p_LM(unk) will be used, which is bad (a lot of unk you will have, because letters are rare compared to wp for common words).

tumusudheer commented 4 years ago

Hi @tlikhomanenko

Thank you very much. So If I understand correctly, If I want to use a token level LM, then I need to train a token level AM as well ? because if my AM is a word based (which is based on word pieces), then the LM also has to be trained on same tokens that my AM has been trained (means this token set may contain words as well) ?

From your response here

About lexicon-free approach - here you need to have lexicon-free decoder (AM can be the same), so you have token-level LM and apply it at each decoding step + no restriction on lexicon. This could improve OOV recognition.

I'm under the assumption that I can keep my current AM architecture (streaming-convnets) and then have a token level LM (and I can use it with use_lexicon=false lexicon free decoder) Please correct me If I'm wrong.

Or as per this comment:

My advise is to have your own token set, not the Librispeech one (or at least analyze the intersection of lexicons), so you can use our pretrained models and remove the last layer, add new one and finetune to predict your token set. Here of course you need extra work and coding on your own When I start with my own token set, I can have both individual letters and words as tokens from my data and then after training a streaming convnets AM, I can train a token level LM and use it with lexicon free decoder ? Is the the best approach to use a token level LM with my existing AM ?

Thank you

tumusudheer commented 4 years ago

Hi @tlikhomanenko

Sorry, Just wanted to know little bit more details for the following questions:

  1. If I want to use a token/character level LM, do I need to have the token level acoustic model (AM) as well ? So I can't use word based AM and then token level LM ?

  2. You suggested me to try a token level LM with lexicon free decoder for better prediction of out of vocabulary words. Based on my first question, If I need to train a token level AM ( to use with token level LM), then my training data for acoustic model also should be "h e l l o | w o r l d" for a sentence "hello world" ? For a token level LM, you wanted me to put "|" at the end of this sentence so it will be "h e l l o | w o r l d |". Do I need to put "|" at the end of the line for the token level acoustic model (AM) as well (for training)?

Thank you.

tlikhomanenko commented 4 years ago

Sorry for delay,

So If I understand correctly, If I want to use a token level LM, then I need to train a token level AM as well ? because if my AM is a word based (which is based on word pieces), then the LM also has to be trained on same tokens that my AM has been trained (means this token set may contain words as well) ?

Yes

I'm under the assumption that I can keep my current AM architecture (streaming-convnets) and then have a token level LM (and I can use it with use_lexicon=false lexicon free decoder) Please correct me If I'm wrong.

Yes

  • If I want to use a token/character level LM, do I need to have the token level acoustic model (AM) as well ? So I can't use word based AM and then token level LM ?

if you use token level LM the tokens you use in LM and AM should be the same (otherwise you don't know how to segment, either we have notion of words, in this case you AM could use any tokens but your lexicon can say you that after some seq of tokens is word and then you can apply word-level LM, or you have no any notion of words and then tokens should be the same for AM and LM).

If you train AM on word-pieces then you LM data should be prepared with the same word pieces, so you use your training lexicon to map words into word-pieces. The same story for letters tokens.

2. You suggested me to try a token level LM with lexicon free decoder for better prediction of out of vocabulary words. Based on my first question, If I need to train a token level AM ( to use with token level LM), then my training data for acoustic model also should be "h e l l o | w o r l d" for a sentence "hello world" ? For a token level LM, you wanted me to put "|" at the end of this sentence so it will be "h e l l o | w o r l d |". Do I need to put "|" at the end of the line for the token level acoustic model (AM) as well (for training)?

for AM training lexicon is used to map words into tokens. So if you provide lexicon with mapping each word into letters sequence + "|" at the end you training seq will be the same as you do for LM. The simplest thing is to have all words in both AM and LM data, prepare their mapping into tokens (in case of letters tokens you add the sil token at the end to have words separation) and then use lexicon to train AM and use the same lexicon to prepare LM data on which you will train LM.

tumusudheer commented 4 years ago

Hi @tlikhomanenko

Thank you very much. I understand better now. Just to confirm, If I want to train a token level LM and a token level AM: Lets say if I've two audio files for training, I'll prepare my train.list as follows:

train_1 1.wav <duration_in_milli_seconds> hello world
train_2 2.wav <duraiton_in_milli_secodns> how are you

My token set: All a-z and 0-9 ' and | My lexicon:

hello h e l l o |
world w o r l d |
how h o w |
are a r e |
you y o u |

I will put all the words (that are present in my AM and LM data) and their token sequences into a single lexicon, and will use the same lexicon for both AM and LM training. And my training data (.txt) for LM will be

h e l l o | w o r l d | 
h o w | a r e | y o u | 

Please let me know if I'm making any mistakes in my example and data preparation.

tlikhomanenko commented 4 years ago

Yep, correct! for LM training don't use dev/test sets from AM data (you can use train transcriptions, otherwise you will over-fit and cannot notice/measure it).