Support tokenization with concatenated orthography profiles

cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation

Apache License 2.0

31 stars 13 forks source link

Closed xrotwang closed 6 years ago

xrotwang commented 6 years ago

xrotwang commented 6 years ago

Somewhat difficult to implement within the tokenizer, while rather straightforward to do on the caller side:

def xsampe2tokens(s):
    t1 = Tokenizer(profile='xsampa')
    t2 = Tokenizer(profile='ipa')
    return t2(''.join(t1(s).split()))

so won't implement.