cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Non-conatative orthography #37

Open HughP opened 6 years ago

HughP commented 6 years ago

Do yo have any suggestions for handling non-conatative or non-adjacent characters in an orthography that work together to show a single phoneme? I have a use case where an orthography's tone marking is partially done prior to the stem and then partially done after the stem. Example: < "bob- > is a High-low marking around the stem, the pattern is < "[:alpha:]*- >. In this particular orthography there are 5 items that can occur before the segmental portion of the word/lexical string, and 4 after the segmental portion of the word. Together these marks create a variety of tonemes/word melodies.

bambooforest commented 6 years ago

Handling in the sense of tokenization?

HughP commented 6 years ago

Yes. In the since of tokenization. And the creation of an orthography profile.

bambooforest commented 6 years ago

The basic orthography profile creation function will list the beginning and end characters (e.g. <"> and <->) as separate graphemes, each on their own line (as it will for any digraphs you might have, e.g. -> , ). You will want to hand-correct these in your orthography profile creation.

For tokenization then, you will want to create a orthography rules file, where your regular expression captures the non-adjacent dependencies, e.g.

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L173 https://github.com/cldf/segments/blob/master/tests/test_tokenizer.py#L141

which you can then run with your corrected orthography profile with the rules file to tokenize the input data.

Would be great to see what you end up with -- this is an awesome use case!