Open isgursoy opened 2 years ago
For clarification on 3. Have an OOV table to report missing words when asked.
I assume by vocabulary, you mean that all the words that our system might encounter, more specifically the words ASR will most likely deal with. E.g What's the price
, Hello Valerie
, I want echelon row-s
, I want to get information about rowah
. The vocabulary will consist of the set of all the words we have seen in these examples. As an example we can build a possible list of vocabulary by looking at the set obtained from
We maintain a set of all such words in some database and whenever we have new words, we scan the system and report them back to you whenever you ask for them.
Is that correct? Pardon me, but it's a bit difficult to follow the tasks you have outlined. If you could elaborate a bit more, it would be really helpful.
Totally corect. Just that database should get cmudict 0.7b as reference now for this project.
We maintain a set of all such words in some database and whenever we have new words, we scan the system and report them back to you whenever you ask for them.
word_list.txt Hi, @isgursoy I made a word list of 4765 words. I didn't know how to put them in cmudict-0.7b.txt file. Is it possible to have more than one same word in vocabolary list ? Will I check to see if it is in or not ?
I made a word list of 4765 words. I didn't know how to put them in cmudict-0.7b.txt file. Is it possible to have more than one same word in vocabolary list ? Will I check to see if it is in or not ?
Is it the unique list of words from 615dcbc2f1223f001a9c3c9c.csv?
Please don't upload or modify anything under gdrive/Development/models.
Hi @isgursoy
I'm using knime to make inferences with dataset.
There are three word list;
I joined 3 dictionaries and compared them with our word list. There are 375 missing words in total. I exracted missing words from 615dcbc2f1223f001a9c3c9c.csv file.
Missing Words: missing_words.csv
In total word list contain 518363 word. BigDictionaryWithOurCorpus.csv
There are most of g2p model They all are more or less the same we can just pick the code we like (pytorch probably).
ciscodev_missingword_phonem.txt
https://github.com/cmusphinx/g2p-seq2seq => It is a model that we do not prefer because it is implemented using Tensorflow.
https://github.com/hajix/G2P => The trained model is not shared, so we cannot use the ready model. We need to retrain this model.
https://github.com/mdda/g2p => This model is more like cmudict with the use of numbers between syllables.
Also, it is quite easy to intervene and add rules to the (g2p.py) code written for the model. https://github.com/mdda/g2p/blob/master/g2p_en/g2p.py
For example :
WORLD => W Y L D (roedoejet/g2p)
WORLD => W ER L D (CiscoDevNet/g2p_seq2seq_pytorch)
WORLD => W ER1 L D (cmudict-0.7b.txt)
WORLD => W ER1 L D (mdda/g2p)
The results for the models are in the below.
I could not get first schema but you are proposing to proceed with mdda/g2p synthesizer right? Yes, as I see from https://github.com/Vocinity/context-scorer/files/7905547/missing_words.csv (good to use comma delimiter for csv files) you are right.
I had 2 different models for phoneme synthesis models. To decide which of the 2 models is better.
I couldn't be sure when I looked for a less dataset with my eyes, so I prepared a test set. And I scored the syllables that I synthesized from the two models with the syllables prepared by hand.
I got 4029 random words and phenomena in cmudict-0.7b.txt, which I am sure is correct. We are sure of its accuracy as this set is prepared by hand. Test Set : It contains samples in all words from A to Z.
Using this list of words, I took the results in two models and scored them with actual test word list. Test Set : cmudict-0.7b_testSample.csv
Below you can see the results of the scores between the actual word phenomena and the model synthesized.
https://github.com/CiscoDevNet/g2p_seq2seq_pytorch
When I look at these results, I can say that the mmda model is more accurate. In addition, when i getting the phoneme result for the model, it contains numbers such as 0, 1 and 2 between syllables, as in cmudict-0.7b.txt.
The cisco model makes mistakes over a thousand words. The Mdda model, on the other hand, misspelled five hundred words.
Validation Result :
cmu_mdda_validation_result.csv
cmu_cisco_validation_result.csv
I printed out correct and misspelled words for both models. I will check the wrong ones. In summary, I decided that it would be more correct to continue with the MDDA model.
- Integrate Grapheme-To-Phoneme (g2p) DL model into homonym composer class and optimize its latency as much as possible. In case of OOV inputs; run model, [@isgursoy]
OOV is our second big enemy. In best case, it makes context scorer harmlessly useless. Is being useless good? Soundex and Double Metaphone matching methods are against OOV but they provide quite low accuracy compared to phoneme matching. So phoneme dictionary should be able to grow dynamically and size of similarity maps should not be limited by distance of 2.
Here tasks:
Here details: