cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Allow specification of lexeme boundary marker #14

Closed xrotwang closed 7 years ago

xrotwang commented 7 years ago

Currently, lexemes boundaries are marked using #. It seems that _ is also quite common. Thus, it would be convenient if Tokenizer.__call__ would gain another keyword argument boundary_marker='#'.

LinguList commented 7 years ago

Ah, I see what you mean: the boundary marker should be specified before instantializing tokenizer objects. Right now, the "separator" keyword can be chosen to replace the boundary-marker in output of the command tokenizer.transform(), but in that case, this command would then be superfluous.

xrotwang commented 7 years ago

The way things are setup now, the main entry for tokenization is Tokenizer.__call__, which delegates to Tokenizer.transform in case a profile was specified. So I guess, transform should be renamed _transform to not have too many public interfaces - or __call__ must provide access to all arguments of the methods it delegates to.

LinguList commented 7 years ago

A, I see, sorry for not getting this earlier. Since "transform" was still working, I didn't see the change in my applications, as I was still using "transform" as usually.