Open GoogleCodeExporter opened 9 years ago
Hi, thanks for this report. I will take a closer look at this this week and
try to fix it. As a hack to get something into the model in the meantime, you
could try adding a standalone entry to your aligned lexicon that provides an
example of the correspondence you are looking for. Out of curiosity, roughly
what is the size of the lexicon you are using?
Original comment by Josef.Ro...@gmail.com
on 16 Feb 2015 at 1:17
Hello and thank for the quick reply!
Indeed adding a stand alone entry will fix the problem.
When I actually debugged this problem I had a very small train dictionary of
only 500 words which is why the problem happened (I don't have much valid data
for now).
I believe I have also seen it it happen before with larger dictionaries I
tested with of up to 160 000 words, but I did not debug the problem when it
happened, just noticed it was strange that it reports a symbol not found when
the train and test dictionaries are the same, since they should have the same
input symbols.
I think it might be "French" issue when some more rare accents in graphemes,
such as "Ö" "ú" "û" do not end up being aligned to phonemes by themselves
even in a larger dictionaries.
Original comment by sorin.io...@sovo-tech.com
on 16 Feb 2015 at 3:36
Ok good to know. Another thing you could try would be to dump the n-best
alignment lattices [at least for the larger dictionary you mention]. Probably
n=2 is fine. You can use the output of this as direct input to the google
ngram library tools, in combination with witten-bell smoothing [which supports
the fractional counts in the lattices]. You could also dump the raw n-best
alignments, and use this to train the model [basically your training corpus
would then consist of the top-2 alignments for each entry]. I think you can
threshold the n-best [--pthresh maybe?] in the aligner too. Unfortunately when
I experimented with these variants in the past, the quality was always degraded
a bit compared to kneser-ney or maxent.
Original comment by Josef.Ro...@gmail.com
on 16 Feb 2015 at 3:44
Thank you again for the other great suggestion, I will give 2-best alignment a
try and when I do I will let you know how the results compare.
For now I took a quicker path to get around the problem and that was a small
modification in the method I mentioned above (tokenize_entry), to be less
strict and look for two symbols at a time when validating the input symbols if
one is not found.
i.e.
When validating "août a u" it will accept "û" as long as the next symbol is
"t" and "û|t" is in the symbol table. if I would give it "aoûr" and it did
not see this in training, it would still reject the "û". The decoder correctly
decodes my word to "a u" after that.
Original comment by sorin.io...@sovo-tech.com
on 16 Feb 2015 at 4:06
Original issue reported on code.google.com by
sorin.io...@sovo-tech.com
on 13 Feb 2015 at 3:18