WIP: Support sentencepiece

aalto-speech / subword-kaldi

Properly handle position-dependent phones in a subword lexicon FST

MIT License

31 stars 3 forks source link

WIP: Support sentencepiece #1

Closed Gastron closed 4 years ago

Gastron commented 4 years ago

This PR implements proper handling for sentencepiece-style gluing.

STATUS: First implementation done, not tested atm.

psmit commented 4 years ago

Looks already pretty good. Would you have some toy example data for easy testing?

Gastron commented 4 years ago

I don't have toy data, but I think I can whip up a hand written lexicon_disambig example easily. I intend to create a very small FST at first, and create an image to see that it looks ok. Then I'll try some larger data and see that it compiles without problems.

Gastron commented 4 years ago

Ah, a slight annoyance. So with the other lfst scripts, you simply use the pronunciation from the lexiconp_disambig.txt. However, sentencepiece can place the space character anywhere (technically even inside subwords). The g2p mapper cannot handle the space character natively.

My current idea to solve this is to create a separate script that calls the g2p mapper for each subword part, stripped of the space characters, and then inserts special placeholder phones for the space character. A little ugly, but in practice I think that should work. I think I'll include that script here as it's pretty much necessary for using this script, if that's ok?

psmit commented 4 years ago

Sure, feel free to include it.

Gastron commented 4 years ago

Even with just one subword of all 4 possible "types", the resulting lexicon pic is quite large. Well, I'll try to decode this later:

Gastron commented 4 years ago

This is ready to merge, though we should update the documentation. But I think the documentation needs a more general lookover (some legacy code in the examples), so maybe a separate PR.

psmit commented 4 years ago

Looks good indeed! One other gap in the documentation is that you should also fix "align_lexicon.xxx" in the phones directory, otherwise it is not possible to do phone alignment.