AdolfVonKleist / Phonetisaurus

Phonetisaurus G2P
BSD 3-Clause "New" or "Revised" License
445 stars 122 forks source link

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

Open AdolfVonKleist opened 7 years ago

AdolfVonKleist commented 7 years ago

The topic of LM training came up again recently.

The aligner produces weighted alignment lattices. There is some evidence that augmenting the Maximization step in the EM alignment process with the sort of expected-count KN smoothing described in this paper may/should improve the overall quality of the G2P aligner:

The same approach may be used to directly train the target joint n-gram model from the resulting alignment lattices. I previously tried the latter using the WB fractional counts implementation in OpenGrm NgramLibrary, but it seemed to have little impact. The Zhang paper notes a similar outcome and that EC-KN appears to be much more performant, even compared to the fractional KN implementation employed in Sequitur.

If I'm going to include some form of LM training after all, maybe this represents the most appropriate choice. There is also reference implementation as a Ghiza add-on:

smilenrhyme commented 4 years ago

@AdolfVonKleist thanks a lot for such wonderful library ๐Ÿ‘ I am having following questions, please share your thoughts.

[Does usage same as defined in Ref 1 HMM where emission probabilities comes from alignment module and transition prob comes from LM trained on phonetic sequences passed in training as word pair <Grapheme /t Phoneme> ? but looks like you are trying to improve alignment module itself using KN smoothing]

Note : Got the overview of this work from these references :

  1. https://www.aclweb.org/anthology/N07-1047.pdf (M2M EM -> HMM)
  2. Improving WFST-based G2P Conversion with Alignment Constraints and RNNLM N-best Rescoring

Thanks a lot !!

AdolfVonKleist commented 4 years ago

You should be able to use kenlm to perform the ARPA training directly. Just use the command line utilities instead of the python wrappers.

$ phonetisaurus-align --input=cmudict.formatted.dict \
  --ofile=cmudict.formatted.corpus --seq1_del=false
# Train an n-gram model (5s-10s):
$ estimate-ngram -o 8 -t cmudict.formatted.corpus \
  -wl cmudict.o8.arpa
# Convert to OpenFst format (10s-20s):
$ phonetisaurus-arpa2wfst --lm=cmudict.o8.arpa --ofile=cmudict.o8.fst

just replace the estimate-ngram call with an equivalent kenlm command. You'll need to output it to ARPA text format though, so that you can still transform it into a WFST for inference.

The mitlm call is trained on the output of the alignment - it just treats the aligned and segmented joint token sequences a 'normal' text corpus.

smilenrhyme commented 4 years ago

@AdolfVonKleist Thanks for quick response ๐Ÿ‘

Does RNNLM work in current master code. Is this used in same way as mit-lm ? or Is there any goodness achieved with RNNLM over mit-lm ?

And just to clarify, this is what you are saying about last line. [Steps from paper referenced above]

image

Thanks :)

AdolfVonKleist commented 4 years ago

Hi,

Yes it should work, however the rnnlm code has not been updated since that earliest release, and is effectively the same as the original Mikolov code from that time. The only novel contribution there is the joint token implementation of the decoder.

I did not find it to yield any significant improvement over mitlm as a pure alternative, and the training time, as well as decoding time were significantly slower. The only place where it yielded a modest boost was when used in ensemble with mitlm as described in the paper [but again there is a time penalty]. Whether or not that was/is sufficient reason to use the combined system in a real-world or production setting, as opposed to just the normal joint ngram models, would probably depend on how heavily you prioritize speed versus absolute accuracy.

Best, Joe

2020ๅนด4ๆœˆ6ๆ—ฅ(ๆœˆ) 0:43 smilenrhyme notifications@github.com:

@AdolfVonKleist https://github.com/AdolfVonKleist Thanks for quick response ๐Ÿ‘

Does RNNLM work in current master code. Is this used in same way as mit-lm ? or Is there any goodness achieved with RNNLM over mit-lm ?

And just to clarify, this is what you are saying about last line. [Steps from paper referenced above]

[image: image] https://user-images.githubusercontent.com/45142420/78534235-b5e4db00-7807-11ea-9463-60374d6df83e.png

Thanks :)

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AdolfVonKleist/Phonetisaurus/issues/24#issuecomment-609629902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVUA5U5WV2D6P2Y7WTESLRLGBZVANCNFSM4DW3MMRA .

smilenrhyme commented 4 years ago

Thanks a lot for detailed perspective ๐Ÿ‘