flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Using user hints/phrases at inference time(runtime) to improve accuracy (Adding more vocabulary at run time) #808

Open tumusudheer opened 4 years ago

tumusudheer commented 4 years ago

Question

I'm using Steaming Convnets Model to train my own dataset (English) that is from different domain, Mostly financial related data. Also I'm doing online inference using the example  SimpleStreamingASRExample.cpp

Question1 : Words that are in the lexicon At inference time (runtime), Is there a way to increase probabilities of detecting/decoding the words/phrases that is related to a particular user data as context hints (will be provided at runtime/inference time) ? Eg: User can create multiple expense types, And when the user is interacting, I want to give high probability to his expense types rather than all the users expense types. These hints could be words or phrases but present in the lexicon while training the AM and LM models.

Something similar to google cloud speech adaptation

Question2 : Words that may not present in the lexicon

User can use any entity name (street name or person name) that may not be present in lexicon. If I know all the list of entity names (person names and street names, and addresses) user can potentially use (this list will not be available at training time), Is there a way I can provide them to decoder/lexicon as hints (additional vocabulary) [Or add an additional decoder] so the entities will be detected accurately with high probability ?

vineelpratap commented 4 years ago

Hi, We currently don't support speech adaptation example that you have mentioned. How are creating word piece tokens, lexicon and language model for your workflow. You can try to make sure they are in the domain that you are working on instead of LibriSpeech based ones.

tumusudheer commented 4 years ago

Hi @vineelpratap,

Thank you very much.

I'm keeping the token set the same but expanding the lexicon with new words from my domain specific data. And I'm training n-gram language model with LibriSpeech data + my domain data using KenLM

But I'm having difficulty to get entity names recognized because it is very difficult to have all possible person names or Street names (address or location) in the lexicon. Eg: some names likeFiona walker or milda bradwell or first names like ezekiel

At training time, I will not have this data. But at the time of inference, I will have a list of entity names that user provides that may not exists in my training data. Wondering if I can use this list that I get inference time to accurately recognize time. Obviously it is difficult to recognize for any ASR, all possible terms accurately. If there are any suggestions on how I can make use of the user data at the inference time to increase recognition accuracy with some modifications to the code/decoder , that would be great.

Thank you,

vineelpratap commented 4 years ago

but expanding the lexicon with new words

Are you using the sentencepiece model trained on librispeech data to expand the lexicon ? If you are using your custom training data, it might be better to build sentence piece model and corresponding token set, lexicon from the training data that you use. I think it should improve the WER performance on your dev set as well.

it is very difficult to have all possible person names or Street names (address or location) in the lexicon

One option, you have is to look into the paper here https://github.com/facebookresearch/wav2letter/tree/master/recipes/lexicon_free for how to recognize new words. We don't support streaming inference on that recipe directly at the moment though.

I'm training n-gram language model with LibriSpeech data + my domain data using KenLM

One issue with this approach is, if the amount of text in LibriSpeech data >>> text in domain data, then final LM will not have much info about the domain. Instead you should look into creating mixture-LM - See --mix-lm command in https://cmusphinx.github.io/wiki/tutoriallmadvanced/.

tumusudheer commented 4 years ago

Hi @vineelpratap

Thank you very much.

Are you using the sentencepiece model trained on librispeech data to expand the lexicon ? If you are using your custom training data, it might be better to build sentence piece model and corresponding token set, lexicon from the training data that you use. I think it should improve the WER performance on your dev set as well.

I'm building the sentence piece model using my own dataset by using the example code given here by @tlikhomanenko . . Currently I'm only expanding the lexicon with my domain data words and keeping the the same token file because I want to finetune AM (using fork command) from existing model rather than training from scratch. I've probably 100-200 hours of Audio data to start my AM training at this time. So rather than training the Acoustic model from scratch, I thought it is better to finetune from existing model. Let me know if this is not a correct/right approach.

One option, you have is to look into the paper here https://github.com/facebookresearch/wav2letter/tree/master/recipes/lexicon_free for how to recognize new words. We don't support streaming inference on that recipe directly at the moment though.

online streaming is very important for me so that is why I'm using streaming convnets at this point of time to build my first model. Once the first model is ready and working, then I can experiment with different architectures and models to improve my accuracy.

One issue with this approach is, if the amount of text in LibriSpeech data >>> text in domain data, then final LM will not have much info about the domain. Instead you should look into creating mixture-LM - See --mix-lm command in https://cmusphinx.github.io/wiki/tutoriallmadvanced/.

Oh Thank you very much, I didn't know about this. Currently wav2letter only support either kenLM or fairseq LM Models, right ? Do you know how to do this mixture-LM using kenLM. Upon quick search, I could only find this thread but it doesn't have complete details. I didn't build my LM yet but I'm about to follow instructions in recipes/models/sota/2019/lm (here) about data preparation. Step1: run the prepare_wp_data.py against the existing librepseech text + my own text. And then Step2: use the commands given under TRAINING section to build 6-gram lm model using kenLM. Please let me know how can I build mixture-LM using kenLM if there is a way to do it.

tlikhomanenko commented 4 years ago

@tumusudheer

You can do mixing of lms in the SRILM, it takes two arpa files and doing mixing. Then mixed arpa file you can convert to the bin with kenlm and use in w2l.

About lexicon-free approach - here you need to have lexicon-free decoder (AM can be the same), so you have token-level LM and apply it at each decoding step + no restriction on lexicon. This could improve OOV recognition.

My advise is to have your own token set, not the Librispeech one (or at least analyze the intersection of lexicons), so you can use our pretrained models and remove the last layer, add new one and finetune to predict your token set. Here of course you need extra work and coding on your own. About specific list of entities support, probably you could work on extending the current decoding in inference pipeline for your own purposes.

tumusudheer commented 4 years ago

Hi @tlikhomanenko ,

Thank you very much.

You can do mixing of lms in the SRILM, it takes two arpa files and doing mixing. Then mixed arpa file you can convert to the bin with kenlm and use in w2l.

Great. So I'll prepare one LM from my domain data (say A.arpa) and have the librispeech LM (say B.arpa), Then I will prepare a mix-lm using SRILM which will give me another .arpa file (say C.arpa) and then convert that C.arpa to a bin using kenLM and can use it. Nice, Thank you

About lexicon-free approach - here you need to have lexicon-free decoder (AM can be the same), so you have token-level LM and apply it at each decoding step + no restriction on lexicon. This could improve OOV recognition.

Sure I'll try this. I've to convert the models to serialize them and also change the inference platform to run the lexicon free models.

My advise is to have your own token set, not the Librispeech one (or at least analyze the intersection of lexicons), so you can use our pretrained models and remove the last layer, add new one and finetune to predict your token set. Here of course you need extra work and coding on your own.

Yeah, Eventually my plan is to do this. I think the code provided in #507 (here) will be a good starting point to make code changes and will post here if I've any other questions regarding these code changes.

About specific list of entities support, probably you could work on extending the current decoding in inference pipeline for your own purposes.

I'll have some entity names available at the inference time (with some Maximum limit may be around 5000 entities). If it possible to make use of them at the inference time by making some code changes, that would be awesome. I can certainly make the code changes to the inference pipeline. It would be great if you can describe what/how should I change (mostly verbally). I think you mentioned one approach here. You suggested to train another additional ngram lm on these special entities/phrases. Then I can add this additional LM into the decoder and its weights. And during decoding we can optimize AM_score + alpha lm + beta lm_special + gamma * word_score.

I'll try and implement this approach. Let me know if you have any other ideas and I try them as well.

Thank you

tlikhomanenko commented 4 years ago

One of the options which I am thinking is:

Not sure here, didn't have experience with this, also good this to do is look for the papers, I am sure people published some approaches on this.

bharat-patidar commented 4 years ago

Hi @tlikhomanenko Is your suggestion similar to what I found here (This approach is to add custom vocabulary or domain specific words for a Kaldi model)

One of the options which I am thinking is:

  • you can add all entities to you lexicon with their unigram probability, so now potentially you can infer them
  • their prob should be higher than ngram you have in standard LM so that unigram is more preferable in case acoustic is predicting them

Not sure here, didn't have experience with this, also good this to do is look for the papers, I am sure people published some approaches on this.

tlikhomanenko commented 4 years ago

yep, some sort of, however we don't have wfst, so it should be simpler, and you can operate on LM model and lexicon only.

abhinavkulkarni commented 4 years ago

Hey @tumusudheer,

About personalizing LM for every user (LM biasing), you may want to take a look at Google's RNN-T paper (section titled "Contextual Biasing"). This is an earlier version of the model that runs on Android devices and is able to recognize names from a user's contacts, etc.

I am not sure how easy it is to incorporate something like this in KenLM or ConvLM, however, wanted to point this out to you.