Chinese word segmentation model for spaCy

Hi @PythonCancer,

...word segmentation model zh_core_web_sm-3.5.0 in spaCy...

zh_core_web_sm-3.5.0 is a pre-trained spaCy pipeline, not just a word segmentation model. For word segmentation for Chinese text in spaCy see https://spacy.io/usage/models#chinese - we support character segmentation and the two third-party word segmenter jieba and pkuseg.

The other file is features.msgpack; what is this file for? Is it for features?

Yes, features used by pkuseg to determine how to perform word segmentation.

Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?

spaCy itself doesn't provide specialized components for word segmentation (other than for tokenization, lemmatization, dependency parsing etc.). If you want to train your own word segmentation model and it outperforms the ones integrated in spaCy w.r.t. accuracy or speed, we're happy to consider integrating it.

explosion / spaCy

Chinese word segmentation model for spaCy #12923