cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Support tokenization with concatenated orthography profiles #36

Closed xrotwang closed 6 years ago

xrotwang commented 6 years ago

This would enable functionality as described in https://github.com/cldf/segments/issues/28#issuecomment-381619859

xrotwang commented 6 years ago

Somewhat difficult to implement within the tokenizer, while rather straightforward to do on the caller side:

def xsampe2tokens(s):
    t1 = Tokenizer(profile='xsampa')
    t2 = Tokenizer(profile='ipa')
    return t2(''.join(t1(s).split()))

so won't implement.