explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.57k stars 4.35k forks source link

support for zh models #4695

Closed XiepengLi closed 4 years ago

XiepengLi commented 4 years ago

Feature description

Add zh models trained on OntoNotes v5 Chinese.

Could the feature be a custom component or spaCy plugin?

I will released the converted dataset for spacy.

XiepengLi commented 4 years ago

Hi, @adrianeboyd Could you please provide your Email address for me to share the dataset?

adrianeboyd commented 4 years ago

Hi, you can use adriane AT explosion.ai.

I tried training models with the OntoNotes data that you provided a while ago, but due to differences in the word segmentation between jieba and OntoNotes, the results were not particularly good, especially for the dependency parser. I'd be interested to hear the details about your training configuration and results!

XiepengLi commented 4 years ago

Hi, Here is my results with model zh_core_web_lg trained with default parameter settings:

{
    "tags_acc":94.9494027794,
    "token_acc":100.0,
    "uas":81.6803151127,
    "las":76.1250576884,
    "ents_p":74.4781676633,
    "ents_r":72.5384615385,
    "ents_f":73.4955185659
}

I will also check the segmentation problem which cause the low acc, If there's any progress, I'll let you know as soon as possible.

adrianeboyd commented 4 years ago

Are you using -G? My results were similar with -G, but this isn't a realistic scenario.

XiepengLi commented 4 years ago

Are you using -G? My results were similar with -G, but this isn't a realistic scenario. yes

adrianeboyd commented 4 years ago

What tools would you normally use to do word segmentation? I understand that it's a very difficult problem, but do you know if there's an existing tool with relatively good performance for OntoNotes tokens?

I looked at some newer tools that might be feasible for use with spacy (not too large/slow, no GPU required), and one promising tool was pkuseg. jieba's tokenization is at about 80% f-score for OntoNotes and pkuseg is closer to 90% f-score, but it's still so much slower than jieba that we would be hesistant to use it. (My timing tests showed that it was about 5000x slower than jieba. It would also require some additional work because it has some internal normalizations and whitespace handling that would cause problems for spacy.)

The other option that Matt has had in mind is having tokenization split the text into individual characters and then have the parser learn how to merge characters back into words by introducing internal subtok dependency relations. There's an option for this in the train CLI (-T), but it's buggy or something isn't configured well yet because the performance is still really bad (way worse than with jieba).

Just for comparison, my results using jieba without -T:

"uas":51.41222817,
"las":47.0903442495,
"tags_acc":83.6531198066,
"token_acc":84.4090917715,
"ents_p":53.3749830232,
"ents_r":43.1868131868,
"ents_f":47.7434246492,

And jieba with -T:

"uas":53.8835665245,
"las":49.3694223569,
"tags_acc":83.7026001349,
"token_acc":89.3474203841,
"ents_p":60.4875873346,
"ents_r":44.7142857143,
"ents_f":51.4184621217,
XiepengLi commented 4 years ago

normally use jieba with user_dict. thulac may be another candidate, or train a new tokenizer on OntoNotes , or zh char model with addition parameters such as -T and more layers.

adrianeboyd commented 4 years ago

I think the next simplest step is to try using a custom jieba dictionary that's been extended with the OntoNotes train tokens. I'm worried it won't generalize very well, but I guess we can see that to some extent on the dev data. (I think that the OntoNotes dev data is still way more similar to the OntoNotes train data than a lot of data people will want to use with spacy.)

thulac looks interesting, thanks for the information! I'm not 100% sure about the licensing details (the code has an MIT license, but maybe the model is restricted to research use because of the data used for training?), but I could try it out on OntoNotes just to see.

dcsan commented 4 years ago

is there a beta version of the chinese model somewhere? I'm looking to build a learn Chinese bot so even if there is something with just basic support that would be helpful. And I'd be happy to test it. Mainly I'm looking for POS tagging.

adrianeboyd commented 4 years ago

@dcsan: We're still working on incorporating a word segmenter that corresponds more closely to OntoNotes so the overall performance is a bit better.

dcsan commented 4 years ago

for word segmentation, jieba does an OK job. At least it's consistent... is that a part of the pipeline that you could let people just pass in already segmented text?

re jieba, you probably know already, but I discovered that for python the jieba module does a fair bit of POS tagging too: https://github.com/fxsjy/jieba#%E4%BD%BF%E7%94%A8%E7%A4%BA%E4%BE%8B

adrianeboyd commented 4 years ago

@dcsan: The problem is that jieba produces a different segmentation than is used in OntoNotes, so it's hard to train good models when the gold annotations don't line up with the automatically-segmented tokens. You can see in the results above that the tagger drops from 95% to 84% when you switch from perfect segmentation to automatic segmentation with jieba, which is a pretty big drop. The parser and NER drop even more, and we'd like the model performance to be a bit better than this before releasing anything.

XiepengLi commented 4 years ago

for word segmentation, jieba does an OK job. At least it's consistent... is that a part of the pipeline that you could let people just pass in already segmented text?

re jieba, you probably know already, but I discovered that for python the jieba module does a fair bit of POS tagging too: https://github.com/fxsjy/jieba#%E4%BD%BF%E7%94%A8%E7%A4%BA%E4%BE%8B

@dcsan Hi, here is one possible way you can have a try.

from spacy.lang.zh import Chinese
class JiebaChinese(Chinese):
    '''
    >>> import jieba.posseg as pseg
    >>> nlp = JiebaChinese(pseg)
    >>> for token in nlp('我是中国人。'):
            print(token.text,token.tag_)
    我 r
    是 v
    中国 ns
    人 n
    。 x
    '''

    def __init__(self, pseg):
        super(JiebaChinese, self).__init__()
        self.pseg = pseg

    def custom_spacy_pipe(self, vocab, words, flags):
        doc = Doc(vocab, words=words, spaces=[False]*len(words))
        for ix, token in enumerate(doc):
            token.tag_ = flags[ix]
        doc.is_tagged = True
        return doc

    def jieba_tokenize(self, text):
        text = text.strip().replace(' ', '')
        words, flags = zip(*self.pseg.lcut(text))
        return self.custom_spacy_pipe(self.vocab, words, flags)

    def make_doc(self, text):
        return self.jieba_tokenize(text)
dcsan commented 4 years ago

@phiedulxp thanks so much for writing this up. The main step I wanted was to produce dependency graphs of Chinese sentences, to compare with the english translations, to see if the focus is changing at all. Would the patch above enable token.dep ?

btw the jieba tags are a different format from normal spacy POS tags (eg ns rather than nsubj. possibly can be fixed with simple remapping. not sure if that messes with some internal code to find and walk noun_chunks in a doc.

ref https://spacy.io/usage/linguistic-features


@adrianeboyd So you're saying the mismatch of different segmentation boundaries is causing the problem? Could you rejoin all the text and then just use jieba to segment it again? Or are you using some other values (eg POS markup) from the ontoNotes corpus that need to match the segments?

I also saw stanford segmenter can be used like this from python (I always thought it required a JVM running, maybe it's different from coreNLP)

 from nltk.tokenize.stanford_segmenter import StanfordSegmenter

link

lingvisa commented 4 years ago

@adrianeboyd Just curious, for training with annotated corpora, why is a segmenter necessary, since tokens are already segmented? Also, in Spacy, can a segmenter model trained separately using OntoNotes, then a parser is trained on top of the semester? Also, I would like to help with the Chinese model development, if you have an initial model. I am currently using Jieba, but really like to have a full Spacy model for Chinese.

adrianeboyd commented 4 years ago

@lingvisa You can train a model with gold segmentation, but then it's only useful for data that's already been segmented in the same way. You don't get a clear picture of how well it works in real life with incorrect segmentation. You can see from the results above that when going from gold segmentation to jieba, the parser UAS drops from 0.80 to 0.50.

You can see how jieba is used instead of a spacy-internal tokenizer here: https://github.com/explosion/spaCy/blob/4890db63399d24f088ff6978aa157a0e4672e2eb/spacy/lang/zh/__init__.py

Japanese and Korean have similar setups using mecab.

We decided earlier that pkuseg was too slow, but I think we should try it again, because a slow model would be better than no model.

lingvisa commented 4 years ago

@adrianeboyd this link gives an example of how to train a pos tagger. https://spacy.io/usage/training Can a segmenter and pos tagger be jointly trained by Spacy with Chinese OntoNotes? If the dep parser needs a specific segmenter, can we just train the segmenter with the Ontonotes data?

lingvisa commented 4 years ago

@adrianeboyd A follow-up: In Spacy, is there tool to train segmentation? It seems not for English, because it's rule based. If not, in order to use the current OntoNotes or UD-Chinese-GSD corpus to train the dep parser, the best way would be use an external tool to train a segementer based on the two corpora. Is that right?

Also, in Spacy, POS tagger is trained together with dep parser. Please confirm. I hope to get a native chinese model implemention to replace jieba. Thank you.

adrianeboyd commented 4 years ago

@lingvisa No, there is not a built-in statistical model for word segmentation. The built-in tokenizers are all rule-based.

We will probably train only on OntoNotes because we can provide models based on OntoNotes with an MIT license.

The tagger and the parser are completely separate and not trained together.

lingvisa commented 4 years ago

@adrianeboyd I will try to train a segmenter externally with OntoNotes first and see how it goes.

lingvisa commented 4 years ago

@adrianeboyd Please let me know if I can be of any help in get this done.

howl-anderson commented 4 years ago

@lixiepeng @adrianeboyd @dcsan @lingvisa @jarib [non-official] Chinese model for SpaCy 2.2.x already released in https://github.com/howl-anderson/Chinese_models_for_SpaCy. please feel free to try it.

adrianeboyd commented 4 years ago

If anyone would like to test the upcoming Chinese models, the initial models have been published and can be tested with spacy v2.3.0.dev1:

pip install spacy==2.3.0.dev1 pkuseg==0.0.22 jieba
pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-2.3.0/zh_core_web_sm-2.3.0.tar.gz --no-deps

Replace sm with md or lg for models with vectors.

svlandeg commented 4 years ago

Closing this as we now have pretrained statistical models for Chinese. Thanks all for your contributions!

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.