UKPLab / MMT-Retrieval

Other
129 stars 14 forks source link

Negative samples for CE training #7

Open Fantabulous-J opened 2 years ago

Fantabulous-J commented 2 years ago

Hi, thanks for sharing such amazing work.

As inferred from the work, I think you just randomly sample negatives from the whole training dataset?

I wonder why you didn't use hard negatives sampled from the top-k distribution of a BE model, as some works in information retrieval show this could result in a better reranker.

You mentioned that using such hard negatives will lead to overfitting the CE model. Do you think any other factors will result in overfitting? For example, false negatives (i.e., positives unlabelled by annotators) may appear in the top-k predictions of a BE model. So sampling hard negatives from this distribution could potentially mislead the reranker.

gregor-ge commented 2 years ago

Hi,

for the CE, we sampled negatives from all examples, correct.

We experimented with sampling (a portion) of negative examples from a top-k BE distribution but found that to not work well at all. The main problem was that the BE was fixed so the examples are drawn from the same top-k examples throughout the training, which made it likely easy for the CE model to overfit on some artifacts of the BE examples.

A better approach probably would be to also adapt the BE model during training, e.g. by training both at once and regularly updating the examples (similar to https://arxiv.org/abs/2007.00808) but that was to expensive compute-wise for us.

Your idea with false negatives probably also played a role there. It is a known problem in sparsely-labelled retrieval (e.g. https://arxiv.org/abs/2010.08191). In addition, I did an error analysis on Flickr30k and noticed there, too, that a lot of "mistakes" correctly match but are simply not labelled as such. For MSCOCO, there is the Crisscrossed extension (https://arxiv.org/abs/2004.15020) which tries to reduce those false negatives.

Fantabulous-J commented 2 years ago

Thanks for your reply.

I am quite curious about the performance of the CE trained by sampling from the top-k predictions of a BE model. What does "not work well at all" mean? Is adding the CE model even worse than solely using a BE model for retrieval?

gregor-ge commented 2 years ago

I don't have raw numbers available since it has been some time since the experiments, but the CE model trained with top-k BE negatives significantly underperformed with R@1 lower by maybe 10 points or more compared to the CE trained with random negatives.

Note that this is about the performance when using solely the CE model for retrieval. I don't know how it would have worked in a retrieve-rerank setup in combination with the BE model. That is actually an interesting question that we could have investigated. It's possible that the top-k BE negative trained CE model might have been better at "fixing" mistakes by the BE model than a CE model trained with random negatives. Or it might still be worse - hard to tell without trying it out.