Adds escaping to rewriter.

kylebgorman commented 3 years ago

To clarify, [ and ] are special characters in Thrax and Pynini strings. If you want them to be interpreted literally as [ and ], you have to put a backslash escape before them (both in rules and strings in general). pynini.escape automatically does this for us.

FYI, don't test your g2p grammar on the Aeneid text: test it on the output of the normalization grammar, and I think that will dismiss the problem. At a later date we can combine the different grammars into one.

jillianchang commented 3 years ago

To clarify, [ and ] are special characters in Thrax and Pynini strings. If you want them to be interpreted literally as [ and ], you have to put a backslash escape before them (both in rules and strings in general). pynini.escape automatically does this for us.

FYI, don't test your g2p grammar on the Aeneid text: test it on the output of the normalization grammar, and I think that will dismiss the problem. At a later date we can combine the different grammars into one.

So would I use the rewriter tool to produce the output of the normalization grammar, paste that into a separate txt file, then test the g2p grammar with the rewriter tool on that txt file?

kylebgorman commented 3 years ago

I set up the rewriter so that it works well with UNIX-style pipes, so you don't even have to create those intermediate files. This might look something like (not tested):

cat Aeneid01.txt | ./rewriter.py --far normalize.far --rules NORMALIZE | ./rewriter.py --far pronounce.far --rules PRONOUNCE

This is just a temporary hack though: we can either put all the grammar rules into a single FAR later (just by importing the rules we want and then re-exporting them) to be used as part of a cascade (./rewriter.py --far everything.far --rules NORMALIZE PRONOUNCE ...) or we can combine them into a single rule with composition (./rewriter.py --far everything.far --rules EVERYTHING ...) later down the road. But Thrax gives us the modularity to make these decisions later. I'm trying to resist the urge to over-design...

Thinking ahead a bit we may need slightly different "flavors" of the various rules for different data sources. For instance while Pharr uses j and v, maybe we want to make a webapp that can handle text where even glides are written with i and u. Or maybe we want to support text without macrons someday.

CUNY-CL / latin_scansion

Adds escaping to rewriter. #6