Missed tokenizing entity's name

coccoc / coccoc-tokenizer

high performance tokenizer for Vietnamese language

GNU Lesser General Public License v3.0

388 stars 120 forks source link

Missed tokenizing entity's name #13

Closed Luvata closed 4 years ago

Luvata commented 4 years ago

Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer :100:

Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name

For example:

>>> T.word_tokenize("Những lần Lam Trường - Đan Trường tái ngộ chung khung hình ở U50")
['Những', 'lần', 'Lam', 'Trường', '-', 'Đan', 'Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
# Expected result : ['Những', 'lần', 'Lam_Trường', '-', 'Đan_Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']

What can I do to help the tokenizer perform better on these cases ?

bachan commented 4 years ago

We didn't really have any reasonably good NER model for peoples' names in Cốc Cốc, which could easily be included into the package. If you need NER, you can either consider implementing some model yourself on top of it or maybe try the one from underthesea package.

If you have any good model with reasonable performance, feel free to submit a PR. :)

Luvata commented 4 years ago

Thank you for your suggestion. I'm thinking of using another tokenizer on top of CocCocTokenizer, but I really love the speed of CocCocTokenizer, so I wonder if there is a quick (dirty) way to fix my problem ? I haven't dig into the code yet, but may I change some configs, such as some terms in the dictionary, so that it might pass my previous test case on these entity's name ? Will it break other test cases too ? :) I know this is just a temporary fix at my side, I'm totally a newbie on this field, so please don't mind me if I say something stupid

anhducle98 commented 4 years ago

Internally CocCocTokenizer normalize all characters to be lowercase in advance. If you want a quick fix, I think ad hoc post processing is the way to go.

For you specific case, you may try to merge consecutive tokens having words starting with uppercase letters (for example use regex to check). It takes one linear pass through the token list.

It doesn't work on all cases though, for example: "Lam Trường học ở Hà Nội" => "Lam Trường_học ở Hà_Nội". So you may want to also split tokens too, which means you'll need a mini-tokenizer :)

Luvata commented 4 years ago

Wow, it sounds much more complicated than I thought. Once again thank you for your advice, ad hoc post processing seems to be a reasonable choice for me