Closed PythonCancer closed 1 year ago
Hi @PythonCancer,
...word segmentation model zh_core_web_sm-3.5.0 in spaCy...
zh_core_web_sm-3.5.0
is a pre-trained spaCy pipeline, not just a word segmentation model. For word segmentation for Chinese text in spaCy see https://spacy.io/usage/models#chinese - we support character segmentation and the two third-party word segmenter jieba
and pkuseg
.
The other file is features.msgpack; what is this file for? Is it for features?
Yes, features used by pkuseg
to determine how to perform word segmentation.
Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?
spaCy itself doesn't provide specialized components for word segmentation (other than for tokenization, lemmatization, dependency parsing etc.). If you want to train your own word segmentation model and it outperforms the ones integrated in spaCy w.r.t. accuracy or speed, we're happy to consider integrating it.
The Chinese word segmentation model zh_core_web_sm-3.5.0 in spaCy has two files. One is weights.npz, which contains dimensions and model weight values, and I can understand that. The other file is features.msgpack; what is this file for? Is it for features? Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?