darongmean / phonetisaurus

Automatically exported from code.google.com/p/phonetisaurus
0 stars 0 forks source link

Symbol: X not found in input symbols table for symbols that are in symbol tables. #34

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Align with a dictionary that contains a grapheme that will be aligned as two 
graphemes towards one phoneme but never alone. This is happened to me in 
practice with certain accentuated words in French, but it is also possible to 
reproduce it with any sequence like "abcd   E F" that will align in most cases 
like "a|b}E c|d}F". The input dictionary will contain "a|b| and "c|d", NOT "a", 
"b", "c" or "d", not if the symbols are not aligned in another word so that 
they appear alone.  --seq1_max had to be the default 2. 
2. Train with estimate-ngram, build fst and decode with phonetisaurus-g2p-omega 
and decoder_type fst_phi on the same dictionary (another containing the symbols)
3. "Symbol: X  not found in input symbols table . Mapping to null..." appears 
and the words are not recognized properly.

What is the expected output? What do you see instead?

I would expect that they would be recognized properly since the symbols have 
been seen, just not independently.

What version of the product are you using? On what operating system?

Version phonetisaurus-0.8a
CentOS 5 / 7

Please provide any additional information below.

The alignment appeared correct during my debugging and everything in the 
alignment phase and training phase seemed correct. It seemed to be normal that 
the symbols did not appear by themselves if they were not seen alone in a 
different word. It is when the input tokens are verified during decoding that 
there seemed to be a small problem.

The problem might be in the method tokenize_entry from util.cpp which looks for 
symbols in the input symbol table one at a time, but in the case above they are 
only there together, in the form "a|b". Giving it a change and not mapping to 
null immediately but looking for the next symbol too and accepting the "a|b" 
form correctly works.

Original issue reported on code.google.com by sorin.io...@sovo-tech.com on 13 Feb 2015 at 3:18

GoogleCodeExporter commented 9 years ago
Hi, thanks for this report.  I will take a closer look at this this week and 
try to fix it.  As a hack to get something into the model in the meantime, you 
could try adding a standalone entry to your aligned lexicon that provides an 
example of the correspondence you are looking for.  Out of curiosity, roughly 
what is the size of the lexicon you are using?

Original comment by Josef.Ro...@gmail.com on 16 Feb 2015 at 1:17

GoogleCodeExporter commented 9 years ago
Hello and thank for the quick reply!

Indeed adding a stand alone entry will fix the problem. 

When I actually debugged this problem I had a very small train dictionary of 
only 500 words which is why the problem happened (I don't have much valid data 
for now).

I believe I have also seen it it happen before with larger dictionaries I 
tested with of up to 160 000 words, but I did not debug the problem when it 
happened, just noticed it was strange that it reports a symbol not found when 
the train and test dictionaries are the same, since they should have the same 
input symbols.

I think it might be "French" issue when some more rare accents in graphemes, 
such as "Ö" "ú" "û" do not end up being aligned to phonemes by themselves 
even in a larger dictionaries.

Original comment by sorin.io...@sovo-tech.com on 16 Feb 2015 at 3:36

GoogleCodeExporter commented 9 years ago
Ok good to know.  Another thing you could try would be to dump the n-best 
alignment lattices [at least for the larger dictionary you mention].  Probably 
n=2 is fine.  You can use the output of this as direct input to the google 
ngram library tools, in combination with witten-bell smoothing [which supports 
the fractional counts in the lattices].  You could also dump the raw n-best 
alignments, and use this to train the model [basically your training corpus 
would then consist of the top-2 alignments for each entry].  I think you can 
threshold the n-best [--pthresh maybe?] in the aligner too.  Unfortunately when 
I experimented with these variants in the past, the quality was always degraded 
a bit compared to kneser-ney or maxent.

Original comment by Josef.Ro...@gmail.com on 16 Feb 2015 at 3:44

GoogleCodeExporter commented 9 years ago
Thank you again for the other great suggestion, I will give 2-best alignment a 
try and when I do I will let you know how the results compare.

For now I took a quicker path to get around the problem and that was a small 
modification in the method I mentioned above (tokenize_entry), to be less 
strict and look for two symbols at a time when validating the input symbols if 
one is not found. 

i.e.
When validating "août  a u" it will accept "û" as long as the next symbol is 
"t" and "û|t" is in the symbol table. if I would give it "aoûr" and it did 
not see this in training, it would still reject the "û". The decoder correctly 
decodes my word to "a u" after that. 

Original comment by sorin.io...@sovo-tech.com on 16 Feb 2015 at 4:06