P100@1 is too low in retrieval model

facebookresearch / EmpatheticDialogues

Dialogue model that produces empathetic responses when trained on the EmpatheticDialogues dataset.

Other

444 stars 63 forks source link

P100@1 is too low in retrieval model #13

Closed Sjzwind closed 4 years ago

Sjzwind commented 5 years ago

Hi, I construct my dataset for train, valid and test split for multi-turn response selection task: for a session: a, b, c, d, e, f get dataset using this way(for reactonly, randomly sample 99 negative responses for one positive sample): a, b a, b, c, d a, b, c, d, e, f I conduct experiment using interaction-based bert model (easy to conduct). Concretely, I concatenate context and response, and find this task is too easy as P100@1, MAP is 0.9~. There is a huge difference between 0.9~ and your result 0.5~. I think there shouldn't be such a large difference between biencoder-type and concatenate-type results. But I cann't explain it. Could you analysis the reason? Thanks.

EricMichaelSmith commented 5 years ago

How are you computing your P100@1, MAP to get ~90%? Our P@1,100 is the probability of the most highly ranked retrieval candidate (out of 100 candidates) being the correct one. Yeah, I'm sure that concatenating context and response and training on that would likely increase your P@1,100 by a few percentage points, but a 40% jump seems quite high to me.

Sjzwind commented 5 years ago

In addition to the difference between biencoder-type and concatenate-type, I think the another difference is that I treat this task as a two-classification problem instead of batchsize-classification, So I make more training data, I cann't explain the result apart from it, it's so strange...

Sjzwind commented 5 years ago

I compute P100@1 same as yours. But I compute the probability that one candidate(out of 100 candidates) is positive, so I compute 100 times and sort the probabilities descendingly, select the most highly ranked candidate which is positive.

EricMichaelSmith commented 5 years ago

Hmm - can you explain how you are setting up the two-classification problem? Yeah, maybe that's affecting the results...

Sjzwind commented 5 years ago

for true examples: a, b a, b, c, d a, b, c, d, e, f I treat them as positive examples, and I construct nine negative examples for every positive one, negative responses are from all responses in the dataset, so I get the training data:

context response label a b 1
a neg1 0 a neg2 0 ... ... 0 a neg9 0

So the label is 1 or 0, I get a two-classification task.

EricMichaelSmith commented 5 years ago

Hmm, interesting - my intuition would be that sampling 9 negative candidates wouldn't do as well as sampling more, but maybe that's not true in this case. Do you batch examples into groups of 10 and have the 9 negative examples for each positive example be the 9 other examples in the batch? If you do that and you don't shuffle when training, I suppose it's possible that that might allow the model to better distinguish among different utterances within a single dialogue, which might actually improve the performance. That's just a random hunch, though.

Sjzwind commented 5 years ago

Sorry for the late reply. I just treat it a two-classification task and I set the batch size 64, shuffle is used as I use RandomSampler in PyTorch. Maybe I need more experiments as I found that utilizing this kind of method got bad results in PersonaChat corpus, which confused me.

EricMichaelSmith commented 5 years ago

Interesting - yeah, I don't have an answer for this offhand. I think your idea of trying this kind of method on PersonaChat is a good one, because the domain is not so different and so I'd expect that you'd see similar effects if the model is working properly.