Embedding / Chinese-Word-Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量
Apache License 2.0
11.78k stars 2.32k forks source link

How to define a basic line of "good word2vec" #44

Open zhouxincheng opened 5 years ago

zhouxincheng commented 5 years ago

I use the toolkit to evaluate the vector, and I got the answer. However, I wonder if you can tell us what kind of value is the signal of the good vectors?

shenshen-hungry commented 5 years ago

That's a good question.

iris2hu commented 5 years ago

The evaluation is a typical word analogy task, e.g. given the word "man", "king" and "woman", we can use word vectors to compute (king - man + woman). If the result has the highest similarity with the word "queen", it gets the correct answer. There are totally 17813 analogy questions in the evaluation set.

Analogy evaluation is to measure to what extent word vectors capture the linguistic relations. Thus, accuracy the higher the better.

For more information about the analogy evaluation, you could read the paper: Shen Li, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

If you are interested in selecting good embedding resource for downstream tasks, e.g. text classification and name entity recognition, the conclusion of this paper may be useful: Yuanyuan Qiu et al. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings CCL 2018