cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Use information provided in BIPA of CLTS to provide a state-of-the-art IPA segmentizer #29

Closed LinguList closed 6 years ago

LinguList commented 6 years ago

Our code for BIPA lists more than 6000 different potentially valid ways to segment unsegmented IPA. This could be easily used to provide a first reliable segmentation script for "standard" IPA, but it may also be useful for initial orthography profile creation, provided that we list different aliases in the data as well as normalizations which could all be nicely added to the orthoprofile.

xrotwang commented 6 years ago

If I understand correctly, what you want is rather a state-of-the-art IPA orthography profile, no?

LinguList commented 6 years ago

yes, something along htese lines (if it's feasible in the end...)

xrotwang commented 6 years ago

Ok, and this profile could live in the CLTS package, right?

LinguList commented 6 years ago

why not, if you think it is consistent to have profiles there, the better. We could even handle more complex "orthographies" that are similar to sampa in the sense of being not easy to parse via segmented CLTS etc. Actually a good idea! In this case: both profiles, IPA-plain and X-Sampa should be there.

xrotwang commented 6 years ago

Yes, I think, conceptually it's simpler to have segments as the package implementing the orthography profile spec - so it changes whenever the spec changes, while CLTS is the package with knowledge about actual transcriptions and will change as more knowledge comes in. The only "transcription" segments knows about is UNICODE.