coccoc / coccoc-tokenizer

high performance tokenizer for Vietnamese language
GNU Lesser General Public License v3.0
393 stars 123 forks source link

Added keep_puncts option, which allows keeping punctuations in result #9

Closed tranHieuDev23 closed 4 years ago

tranHieuDev23 commented 4 years ago

We need punctuation in tokenization result, to handle special tokens such as ., @ or emoji.

Since dont_push_puncts in run_tokenizer() function is always set to false, and its name collides with new option keep_puncts, it was removed.

On console application, keep_puncts can be set to true with option --k. Tokenization for transformation can be turned on with option --t.

tranHieuDev23 commented 4 years ago

I agree. keep_puncts and dont_push_puncts are not entirely equivalent - dont_push_puncts can be used to remove punctuations when tokenizing for transformation, while keep_puncts can be used to allow punctuations in all cases. My suggestion would be merging these two arguments into a single keep_puncts, which would default to false on for_transformation = false and true otherwise.

tranHieuDev23 commented 4 years ago

My bad, I tested it with --k and --t options and it still ran. Should have put more attention to that.