To reproduce baseline scores

memray commented 2 years ago

Hello @gizacard @GitHub30 ,

I wonder if you can share some details about how to reproduce the unsupervised baseline scores, such as the scores in Table 9. Do you just take existing checkpoints and evaluate them on BEIR or do you pretrain them on your own (using same data/settings as training contriever)? I found that I cannot reproduce the same SimCSE scores with original released checkpoint (https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large).

Also for fine-tuning with MSMARCO, is it similar to the supervised SimCSE training?

Thanks again for sharing the resources! Rui

memray commented 2 years ago

@gizacard Also may I know what learning rate scheduling you use in pretraining? Was there any warmup applied? Thanks!

gizacard commented 2 years ago

Hi,

We use the model joined with the SimCSE release. The differences likely originates from a discrepancy between the two snapshots of code presented in the SimCSE repo: one using HuggingFace model loading and the other being based on their own code. This leads to the default truncation length and the representation extracted not being similar between the two code snapshots. We used the one using HuggingFace, the default text truncation length was 512 and the representation used was the one after the "pooler", while SimCSE implementation uses 128 as default truncation length and extract the representation before the pooler object for unsupervised models. The latest results reported in our paper on arxiv uses 512, the maximum number of tokens allowed by BERT as truncation length, and the representation before the pooler.
I don't know what is the fine-tuning recipe used on MSMARCO by SimCSE. We use in-batch negative examples, similarly to SimCSE, which is very standard to train bi-encoder on supervised data.
We use 5e-5 for pre-training, with warmup for 20k gradient steps, we have just released training code with hyper-parameters.

I hope this helps, Gautier

memray commented 2 years ago

Hi @gizacard ,

I appreciate your help. I'm trying to reproduce the unsupervised results. May I ask some questions about the experiment setting?

Do you clip the gradient norm during training?
According to documents of 256 tokens and span sizes sampled between 5% and 50% of the document length, does it mean the min/max length of Q/D is 12/128 tokens?
Do all runs in Sec 6 (Ablation studies) follow the same setting? What is the size of queue used in the subsection Training data?
I wonder what causes the score gap between two unsupervised runs, 50/50% in Table 8 (avg 34.7) vs Table 11 (36.0)? Is there any difference besides the training steps? Does longer training help much?
Do you observe any performance variance of the unsupervised runs? I found changing random seed in data pipeline can significantly affect the ending scores.

Thank you!!! Rui

facebookresearch / contriever

To reproduce baseline scores #7