facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

The role of basedoc in docspace tests #207

Closed thisisandreeeee closed 5 years ago

thisisandreeeee commented 5 years ago

Related to #75

I couldn't find much documentation describing the role of the basedoc file when running ./starspace test for docspace recommendations (-trainMode 1 -fileFormat labelDoc), and I'd like to confirm if my following assumptions are correct.

Given test.txt of the following format:

roger federer loses <tab> venus williams wins <tab> world series ended
i love cats <tab> funny lolcat links <tab> how to be a petsitter

Starspace uses (n-1) documents to predict the nth document (use i love cats<tab>funny lolcat links to predict how to be a petsitter). My question is, how does it actually predict the nth document? Is the mechanism similar to that in ./query_predict, where it tries to rank the documents in basedoc?

I'm skeptical this is the case, because according to the arXiv paper, starspace "predict[s] the n’th article by ranking it against 10,000 other unrelated articles". Am I correct to say that the basedoc contains these 10,000 unrelated articles?

I would really like to understand how basedoc is used for docspace testing, and what is the expected file format e.g. is each line one document, or one user?

ledw commented 5 years ago

@thisisandreeeee the basedoc provides the other candidate articles to compare to. Your interpretation is correct, which means that it ranks the documents in the basedoc, together with the correct document (the test.txt example you gave is correct. The expected file format for basedoc is one line per document.

thisisandreeeee commented 5 years ago

Thanks for the clarification! Wouldn't this also mean that one should always interpret hits@k in conjunction with the size of basedoc, since hits@20 would mean very different things with a sample size of 50 versus 10,000.

ledw commented 5 years ago

@thisisandreeeee Yes that is how hit@k should be interpreted exactly.