Do you have word frequency list in this corpus?

Hello Chris,

I downloaded the word2vec_matlab package and I believe there is a little inconsistency in the final results. The output below is from the function: measure_accuracy. I think the last line for overall accuracy has a little issue. If you take the sum of all test cases for all categories; i.e. 506 + 4524 + ... + 1332 + 870, the total will be 19544. Note that the total number of test cases in the file: test_analogies.txt is also 19544; actually the number of lines in this file is: 19558 lines. If you subtract the 14 lines (starting with ':' to indicate the category name, one gets again 19544. Similarly the final sum of correct cases seems to have a similar issue; 425 + 3741 + 336 + ... + 1212 + 594 gives 14852. This makes the final accuracy: (14852/19544)*100 = 75.9%

I hope I'm not missing something here. I also checked if any of the words in the test analogies is not in the word2vec model, but it seems that all the words are in. Can you please check this detail in the code.

Thanks!

Measuring model accuracy on completing analogies... capital-common-countries (425 / 506) 83.99% capital-world (3741 / 4524) 82.69% currency (336 / 866) 38.80% city-in-state (1898 / 2467) 76.94% family (432 / 506) 85.38% gram1-adjective-to-adverb (306 / 992) 30.85% gram2-opposite (364 / 812) 44.83% gram3-comparative (1223 / 1332) 91.82% gram4-superlative (1005 / 1122) 89.57% gram5-present-participle (830 / 1056) 78.60% gram6-nationality-adjective (1440 / 1599) 90.06% gram7-past-tense (1046 / 1560) 67.05% gram8-plural (1212 / 1332) 90.99% gram9-plural-verbs (594 / 870) 68.28% Overall accuracy: (14258 / 18674) 76.35%

chrisjmccormick / word2vec_matlab

Do you have word frequency list in this corpus? #1