explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.18k stars 4.4k forks source link

Chinese word segmentation model for spaCy #12923

Closed PythonCancer closed 1 year ago

PythonCancer commented 1 year ago

The Chinese word segmentation model zh_core_web_sm-3.5.0 in spaCy has two files. One is weights.npz, which contains dimensions and model weight values, and I can understand that. The other file is features.msgpack; what is this file for? Is it for features? Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?

rmitsch commented 1 year ago

Hi @PythonCancer,

...word segmentation model zh_core_web_sm-3.5.0 in spaCy...

zh_core_web_sm-3.5.0 is a pre-trained spaCy pipeline, not just a word segmentation model. For word segmentation for Chinese text in spaCy see https://spacy.io/usage/models#chinese - we support character segmentation and the two third-party word segmenter jieba and pkuseg.

The other file is features.msgpack; what is this file for? Is it for features?

Yes, features used by pkuseg to determine how to perform word segmentation.

Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?

spaCy itself doesn't provide specialized components for word segmentation (other than for tokenization, lemmatization, dependency parsing etc.). If you want to train your own word segmentation model and it outperforms the ones integrated in spaCy w.r.t. accuracy or speed, we're happy to consider integrating it.