Open HughP opened 6 years ago
Handling in the sense of tokenization?
Yes. In the since of tokenization. And the creation of an orthography profile.
The basic orthography profile creation function will list the beginning and end characters (e.g. <"> and <->) as separate graphemes, each on their own line (as it will for any digraphs you might have, e.g.
For tokenization then, you will want to create a orthography rules file, where your regular expression captures the non-adjacent dependencies, e.g.
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L173 https://github.com/cldf/segments/blob/master/tests/test_tokenizer.py#L141
which you can then run with your corrected orthography profile with the rules file to tokenize the input data.
Would be great to see what you end up with -- this is an awesome use case!
Do yo have any suggestions for handling non-conatative or non-adjacent characters in an orthography that work together to show a single phoneme? I have a use case where an orthography's tone marking is partially done prior to the stem and then partially done after the stem. Example:
< "bob- >
is a High-low marking around the stem, the pattern is< "[:alpha:]*- >
. In this particular orthography there are 5 items that can occur before the segmental portion of the word/lexical string, and 4 after the segmental portion of the word. Together these marks create a variety of tonemes/word melodies.