Closed Gastron closed 4 years ago
Looks already pretty good. Would you have some toy example data for easy testing?
I don't have toy data, but I think I can whip up a hand written lexicon_disambig example easily. I intend to create a very small FST at first, and create an image to see that it looks ok. Then I'll try some larger data and see that it compiles without problems.
Ah, a slight annoyance. So with the other lfst scripts, you simply use the pronunciation from the lexiconp_disambig.txt. However, sentencepiece can place the space character anywhere (technically even inside subwords). The g2p mapper cannot handle the space character natively.
My current idea to solve this is to create a separate script that calls the g2p mapper for each subword part, stripped of the space characters, and then inserts special placeholder phones for the space character. A little ugly, but in practice I think that should work. I think I'll include that script here as it's pretty much necessary for using this script, if that's ok?
Sure, feel free to include it.
This is ready to merge, though we should update the documentation. But I think the documentation needs a more general lookover (some legacy code in the examples), so maybe a separate PR.
Looks good indeed! One other gap in the documentation is that you should also fix "align_lexicon.xxx" in the phones directory, otherwise it is not possible to do phone alignment.
This PR implements proper handling for sentencepiece-style gluing.
STATUS: First implementation done, not tested atm.