Closed xrotwang closed 7 years ago
Ah, I see what you mean: the boundary marker should be specified before instantializing tokenizer objects. Right now, the "separator" keyword can be chosen to replace the boundary-marker in output of the command tokenizer.transform()
, but in that case, this command would then be superfluous.
The way things are setup now, the main entry for tokenization is Tokenizer.__call__
, which delegates to Tokenizer.transform
in case a profile was specified. So I guess, transform
should be renamed _transform
to not have too many public interfaces - or __call__
must provide access to all arguments of the methods it delegates to.
A, I see, sorry for not getting this earlier. Since "transform" was still working, I didn't see the change in my applications, as I was still using "transform" as usually.
Currently, lexemes boundaries are marked using
#
. It seems that_
is also quite common. Thus, it would be convenient ifTokenizer.__call__
would gain another keyword argumentboundary_marker='#'
.