base checkpoint selection

alsbhn commented 2 years ago

I see in the code that two models (distilbert-base-uncased, msmarco-distilbert-margin-mse) are recommended to use as initial checkpoints. I tried to use other Sentence-Transformers models like all-mpnet-base-v2 but it didn't work. Is there a difference in the architecture of the models and the implementation out there? What models can be used here as initial checkpoints?

kwang2049 commented 2 years ago

Hi @alsbhn, could you please tell me what you mean by "didn't work"? Do you mean the code was not runnable with this setting or something about the performance?

alsbhn commented 2 years ago

The code works well and with no error. But the issue is with the performance. When I use "distilbert-base-uncased" or "msmarco-distilbert-margin-mse" as base checkpoint the performance increases after a couple of 10000 steps as expected but using other models like all-mpnet-base-v2 and all-MiniLM-L6-v2 the model does not perform well on my dataset and the performance even decreases as I train it for more steps.

kwang2049 commented 2 years ago

Thanks for pointing out this issue. I need some time to check what could be the exact reason. As I can imagine, there might be four potential reasons: (1) The base checkpoint might be already stronger than the teacher cross-encoder; (2) The training steps might be too few: For some target datasets, I found there could be degeneration at the beginning, but the final performance would be improved after longer training (e.g. 100K steps); (3) The negative miner might be too weak. For this, we can try setting base_ckpt and retrievers to the same checkpoint, e.g. sentence-transformers/all-mpnet-base-v2. From my experience, I found this is very important when we use TAS-B as the base checkpoint; (4) It might be due to the similarity function between dot product and cosine similarity. @nreimers recently found MarginMSE would result in poor in-domain performance if we use cosine similarity (compared with simple CrossEntropy loss). I am not sure whether this will be the same case for the domain-adaptation scenario. Note that both all-mpnet-base-v2 and all-MiniLM-L6-v2 were trained with cosine similarity.

UKPLab / gpl

base checkpoint selection #10