NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

Rank Hinge Loss with number of negatives Question #748

Closed datistiquo closed 5 years ago

datistiquo commented 5 years ago

Hey,

I have a question for my understanding. How do I need to structure my data for using a pairwise or listwise approach? If I have just relevant/irrelevant documents (1 or 0) then I assume that the number of negatives for the loss is just the number for the negatives examples (0). https://github.com/NTMC-Community/MatchZoo/blob/2.2-dev/matchzoo/losses/rank_hinge_loss.py And I assume that the order needs to be : first the relevant sample and afterwards all negative ones (for pairwise num_neg=2 (?) and for listwise num_neg>2?)

I have about 1000 documents inside the pool and have just 1 relevant of them for each query. Up to now, my precision is very loss with a very high number of false positives. So I want to check a listwise approach. So I would order my training data like for each query I list the one relevant doc and afterwards all examples with same query but the rest of the 999 docs? So I would go on for each training query?

faneshion commented 5 years ago

Yes, the pairwise data structure is what you described. Typically, we assign only one negative document for each positive document, where the loss is calculated with max(0, 1-s(q, d_pos) + s(q, d_neg). In fact, we can also assign several negative documents (e.g., a subset of sampled negative documents) for each positive document, in this way, we can calculate the crossentropy loss, i.e., \frac{exp(s(q, d_pos))}{\sum_d exp(q, d)}.

uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.