chenditc / semanticSimilarity

EECS 499 project.
Apache License 2.0
1 stars 1 forks source link

Add word2vec package similarity measurement #9

Open chenditc opened 10 years ago

chenditc commented 10 years ago

Baseline:

chenditc commented 10 years ago

on word2sense data: word_word approach: Pearson's correlation 0.272020 Spearman's rho 0.223456 word_description: Pearson's correlation 0.305383 Spearman's rho 0.315322 description_description: Pearson's correlation 0.078168 Spearman's rho 0.025718

on phrase2word data: phrase_description: Pearson's correlation 0.209805 Spearman's rho 0.199060 description_description: Pearson's correlation 0.152415 Spearman's rho 0.204833

on sentence2phrase data: sentence_phrase: Pearson's correlation 0.315163 Spearman's rho 0.318721

paragraph_sentence: Pearson's correlation 0.500769 Spearman's rho 0.466644

chenditc commented 10 years ago

Use word2vec to identify phrases

chenditc commented 10 years ago

vector addition approach for long text comparison

word2sense: word to description: Pearson's correlation 0.352023 Spearman's rho 0.374028 description to description: Pearson's correlation 0.220919 Spearman's rho 0.263663

phrase2word: phrase to word: Pearson's correlation 0.440284 Spearman's rho 0.463677 phrase to description: Pearson's correlation 0.290263 Spearman's rho 0.275450 description to description: Pearson's correlation 0.063978 Spearman's rho 0.092464

on sentence2phrase data: sentence_phrase: Pearson's correlation 0.407504 Spearman's rho 0.401474

paragraph_sentence: Pearson's correlation 0.515978 Spearman's rho 0.553896

chenditc commented 10 years ago

Here is some thought I have:

  1. word2vec vector represent the accurate sense pretty well, so the accuracy on comparing short text works well.
  2. For word2sense level, the word to description approach always works better than other approach, because it narrow down the meaning of the sense by extracting it's description.
  3. For phrase2word, sentence2phrase and paragraph2sentence, extract decryption doesn't seems help a lot, since the meaning is already clear within that context. Once we extract it's description, we not only lose the information of the context, but also introduce the noise from description.
  4. For next step, I think we could improve the long text comparison by using parser or some other technique that are more suitable for analyze long text.