Preventing Out Of Vocabulary (OOV) Failures

isgursoy commented 2 years ago

OOV is our second big enemy. In best case, it makes context scorer harmlessly useless. Is being useless good? Soundex and Double Metaphone matching methods are against OOV but they provide quite low accuracy compared to phoneme matching. So phoneme dictionary should be able to grow dynamically and size of similarity maps should not be limited by distance of 2.

Here tasks:

Convert file based and constant phoneme dictionary & similarity maps to dynamic databases. [@tahamr83]
Increase similarity map distance limit to at least size_of_the_longest_word_in_dictionary+1. [@sind4l]
Have an OOV table to report missing words when asked. [@tahamr83]
~Integrate Grapheme-To-Phoneme (g2p) DL model into homonym composer class and optimize its latency as much as possible. In case of OOV inputs; run model, [@isgursoy]~
insert phonemes vector of this new strange word to dictionary, [@sind4l]
calculate cross distances of this new word against all of the already have words, update similarity map database [@sind4l]
asynchronously return and use partial result up to max_distance requirement of current query quickly. [@sind4l]

Here details:

Files are there (https://drive.google.com/drive/u/1/folders/1L0eMmb8pBGcSqSN6xYap_dUJ5O_hyIQI) Similarity maps are binary encoded json files, they are created like https://github.com/Vocinity/context-scorer/blob/06502b0e2eb2fe5de31c89bee3206dfd71951f76/src/Homophonic-Alternatives.cpp#L924
Similarity maps are imported from txt files and kept in memory. Reason for distance=2 limit of similarity map was ram. https://github.com/Vocinity/context-scorer/blob/06502b0e2eb2fe5de31c89bee3206dfd71951f76/grpc-server/src/main.cpp#L527
Perplexity computation model (gpt) should also learn those new words.
https://github.com/Vocinity/context-scorer/blob/06502b0e2eb2fe5de31c89bee3206dfd71951f76/src/Homophonic-Alternatives.cpp#L155 Some state of art pretrained g2p models: https://github.com/CiscoDevNet/g2p_seq2seq_pytorch https://github.com/cmusphinx/g2p-seq2seq https://github.com/hajix/G2P https://github.com/roedoejet/g2p
See details of 1.
Because this will take time for size_of_the_longest_word_in_dictionary+1 neighbours of each dictionary item. Dont block request after you calculate needed amount of distance map for this request. https://github.com/Vocinity/context-scorer/blob/06502b0e2eb2fe5de31c89bee3206dfd71951f76/grpc-server/src/context-scorer.proto#L37

tahamr83 commented 2 years ago

For clarification on 3. Have an OOV table to report missing words when asked.

I assume by vocabulary, you mean that all the words that our system might encounter, more specifically the words ASR will most likely deal with. E.g What's the price, Hello Valerie, I want echelon row-s, I want to get information about rowah. The vocabulary will consist of the set of all the words we have seen in these examples. As an example we can build a possible list of vocabulary by looking at the set obtained from

All intent examples
All possible bot responses
All user messages we have seen till now

We maintain a set of all such words in some database and whenever we have new words, we scan the system and report them back to you whenever you ask for them.

Is that correct? Pardon me, but it's a bit difficult to follow the tasks you have outlined. If you could elaborate a bit more, it would be really helpful.

isgursoy commented 2 years ago

Totally corect. Just that database should get cmudict 0.7b as reference now for this project.

We maintain a set of all such words in some database and whenever we have new words, we scan the system and report them back to you whenever you ask for them.

zeynepVocinity commented 2 years ago

word_list.txt Hi, @isgursoy I made a word list of 4765 words. I didn't know how to put them in cmudict-0.7b.txt file. Is it possible to have more than one same word in vocabolary list ? Will I check to see if it is in or not ?

isgursoy commented 2 years ago

I made a word list of 4765 words. I didn't know how to put them in cmudict-0.7b.txt file. Is it possible to have more than one same word in vocabolary list ? Will I check to see if it is in or not ?

Is it the unique list of words from 615dcbc2f1223f001a9c3c9c.csv?

You should find the words from 615dcbc2f1223f001a9c3c9c.csv which don't exist in cmudict-b.
Then you should synthesize phonemes of each missing word using one of g2p models that listed under step 4. I did not evaluate their accuracy, so I don't know which one is better. I would do that for all known words from 194K, 235K, 466K plain english dictionaries in gdrive/Development/models/context-scorer/homonym-generator/backup, for sure without duplicates.
There can be multiple dialects which differentiated using suffix of (1),(2),(3) so on. Order does not matter for us.

Please don't upload or modify anything under gdrive/Development/models.

zeynepVocinity commented 2 years ago

Hi @isgursoy

I'm using knime to make inferences with dataset.

There are three word list;

I joined 3 dictionaries and compared them with our word list. There are 375 missing words in total. I exracted missing words from 615dcbc2f1223f001a9c3c9c.csv file.

Missing Words: missing_words.csv

In total word list contain 518363 word. BigDictionaryWithOurCorpus.csv

There are most of g2p model They all are more or less the same we can just pick the code we like (pytorch probably).

https://github.com/CiscoDevNet/g2p_seq2seq_pytorch => A new model implemented in pytorch. It is already trained on cmudict and language compilation package also include phonetisaurus

ciscodev_missingword_phonem.txt

https://github.com/cmusphinx/g2p-seq2seq => It is a model that we do not prefer because it is implemented using Tensorflow.
https://github.com/hajix/G2P => The trained model is not shared, so we cannot use the ready model. We need to retrain this model.
https://github.com/mdda/g2p => This model is more like cmudict with the use of numbers between syllables.

Also, it is quite easy to intervene and add rules to the (g2p.py) code written for the model. https://github.com/mdda/g2p/blob/master/g2p_en/g2p.py

https://github.com/roedoejet/g2p => When i compare it with manually created cmudict (cmudict-0.7b.txt) phenomena, it gives quite different results compared to other models.

For example : WORLD => W Y L D (roedoejet/g2p)
WORLD => W ER L D (CiscoDevNet/g2p_seq2seq_pytorch) WORLD => W ER1 L D (cmudict-0.7b.txt) WORLD => W ER1 L D (mdda/g2p)

The results for the models are in the below.

missing_words_model_result.csv

isgursoy commented 2 years ago

I could not get first schema but you are proposing to proceed with mdda/g2p synthesizer right? Yes, as I see from https://github.com/Vocinity/context-scorer/files/7905547/missing_words.csv (good to use comma delimiter for csv files) you are right.

zeynepVocinity commented 2 years ago

I had 2 different models for phoneme synthesis models. To decide which of the 2 models is better.

I couldn't be sure when I looked for a less dataset with my eyes, so I prepared a test set. And I scored the syllables that I synthesized from the two models with the syllables prepared by hand.

I got 4029 random words and phenomena in cmudict-0.7b.txt, which I am sure is correct. We are sure of its accuracy as this set is prepared by hand. Test Set : It contains samples in all words from A to Z.

Using this list of words, I took the results in two models and scored them with actual test word list. Test Set : cmudict-0.7b_testSample.csv

https://github.com/mdda/g2p

Below you can see the results of the scores between the actual word phenomena and the model synthesized.

https://github.com/CiscoDevNet/g2p_seq2seq_pytorch

When I look at these results, I can say that the mmda model is more accurate. In addition, when i getting the phoneme result for the model, it contains numbers such as 0, 1 and 2 between syllables, as in cmudict-0.7b.txt.

The cisco model makes mistakes over a thousand words. The Mdda model, on the other hand, misspelled five hundred words. Validation Result :
cmu_mdda_validation_result.csv

cmu_cisco_validation_result.csv

I printed out correct and misspelled words for both models. I will check the wrong ones. In summary, I decided that it would be more correct to continue with the MDDA model.

isgursoy commented 2 years ago

Integrate Grapheme-To-Phoneme (g2p) DL model into homonym composer class and optimize its latency as much as possible. In case of OOV inputs; run model, [@isgursoy]

is done. https://github.com/Vocinity/context-scorer/blob/b0d80edcce07aefcf58c2065b21e0f4ef1b5e7fb/src/Homophonic-Alternatives.hpp#L28

Vocinity / context-scorer

Preventing Out Of Vocabulary (OOV) Failures #1