Closed Luvata closed 4 years ago
We didn't really have any reasonably good NER model for peoples' names in Cốc Cốc, which could easily be included into the package. If you need NER, you can either consider implementing some model yourself on top of it or maybe try the one from underthesea package.
If you have any good model with reasonable performance, feel free to submit a PR. :)
Thank you for your suggestion. I'm thinking of using another tokenizer on top of CocCocTokenizer, but I really love the speed of CocCocTokenizer, so I wonder if there is a quick (dirty) way to fix my problem ? I haven't dig into the code yet, but may I change some configs, such as some terms in the dictionary, so that it might pass my previous test case on these entity's name ? Will it break other test cases too ? :) I know this is just a temporary fix at my side, I'm totally a newbie on this field, so please don't mind me if I say something stupid
Internally CocCocTokenizer normalize all characters to be lowercase in advance. If you want a quick fix, I think ad hoc post processing is the way to go.
For you specific case, you may try to merge consecutive tokens having words starting with uppercase letters (for example use regex to check). It takes one linear pass through the token list.
It doesn't work on all cases though, for example: "Lam Trường học ở Hà Nội"
=> "Lam Trường_học ở Hà_Nội"
. So you may want to also split tokens too, which means you'll need a mini-tokenizer :)
Wow, it sounds much more complicated than I thought. Once again thank you for your advice, ad hoc post processing seems to be a reasonable choice for me
Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer :100:
Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name
For example:
What can I do to help the tokenizer perform better on these cases ?