Closed tranHieuDev23 closed 4 years ago
I agree. keep_puncts
and dont_push_puncts
are not entirely equivalent - dont_push_puncts
can be used to remove punctuations when tokenizing for transformation, while keep_puncts
can be used to allow punctuations in all cases. My suggestion would be merging these two arguments into a single keep_puncts
, which would default to false
on for_transformation = false
and true
otherwise.
My bad, I tested it with --k
and --t
options and it still ran. Should have put more attention to that.
We need punctuation in tokenization result, to handle special tokens such as
.
,@
or emoji.Since
dont_push_puncts
inrun_tokenizer()
function is always set tofalse
, and its name collides with new optionkeep_puncts
, it was removed.On console application,
keep_puncts
can be set totrue
with option--k
. Tokenization for transformation can be turned on with option--t
.