cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Update to official orthography profile spec #30

Closed xrotwang closed 6 years ago

xrotwang commented 6 years ago

The spec for orthography profiles will be changed to incorporate metadata via CSVW. The segments package should support this enhancement, and also use the metadata file to link and describe additional files, namely rules and replacements (a set of replacements - possibly specified as regular expressions - to be run before tokenization).

LinguList commented 6 years ago

It is important, with the lingpy tutorial in mind, to guarantee backwards compatibility (or to change the tutorial accordingly, if we know when the spec changes). Just to mention this in this context, maybe you already thought about it.

xrotwang commented 6 years ago

@LinguList I didn't think about this explicitly, so thanks for the heads-up. But the idea is to make the current status the fallback in case there's no metadata describing the profile. AFAIK the lingpy tutorial doesn't use a rules file, right? because this may be one thing where I'd like to introduce a backwards-incompatible change.

LinguList commented 6 years ago

Yes! no rules file. I don't like them, as they are too idiosyncratic. Our policy should be: instead of making lazy rules that are difficult to handle (we barely see the full power of a regex), spell the things out: ab -> aa should be written as ab -> a, a -> a, abb -> a, etc. (and we'll barely have == n times, but can say: abb -> a, abbb -> a, and finito).