Cannot reproduce VanillaBERT's result on Robust04

a94763075 commented 4 years ago

I reproduce VanillaBERT but only get the result **NDCG@20:0.3889 P@20 :0.3180 optimizer :AdamW batch size :1 lr = 1e-5* Train by HingeLoss and by following code's provided 'f.train.pairs' list to random choose pos neg pairs

far from paper's NDCG@20:0.4541 P@20 :0.4042


class VanillaBERT(nn.Module):

    def __init__(self):

        super(VanillaBERT, self).__init__()
        self.bert = BertModel.from_pretrained(pretrained_weights)
        self.dropout =  torch.nn.Dropout(0.1)
        self.Out_FC = nn.Linear(768,1)

    def forward(self, _input_ids,_token_type_ids):
        outputs= self.bert(input_ids=_input_ids.squeeze(1),token_type_ids=_token_type_ids.squeeze(1))
        #print(np.shape(outputs[0][0][0]))
        cls_reps = outputs[0][0][0]
        Pred_out = self.Out_FC(self.dropout(cls_reps))

        return Pred_out

seanmacavaney commented 4 years ago

Hi @a94763075,

It looks like there are a lot of differences that you have, as compared to our implementation and what we reported, which may explain the model's low performance. To name a few:

batch size of 1 -- it is recommended to use gradient accumulation, rather than a small batch size
optimizer: Which decay strategy are you using with AdamW? We trained our models without decay. The decay may be particularly harmful with such a low batch size.
It looks like your VanillaBERT implementation doesn't handle documents longer than BERT's maximum length limit. There's a number of strategies proposed contemporaneously to handle this (e.g., by sentence or by paragraph), but we found that simply splitting the document into arbitrary passages based on length and combining the [CLS] representations works fine. See paper and/or our implementation in this repository for details.
Probably minor, but we do not apply dropout between the BERT representation and the final ranking output. We keep all dropout within the BERT model intact.

Can you try using the implementations in this repository directly? Then, if you're still having problems, it will be easier to figure out what's going wrong. Without starting from the same place, it'll be very difficult to figure out what exactly is wrong.

Thanks, sean

a94763075 commented 4 years ago

I have implementations in this repository. Dev NDCG@20:0.4066 P@20:0.4701 It works fine

In my reproduce by your suggestion

gradient accumulation

optimizer I copied this code from repository as optimizer

params = [(k, v) for k, v in model.named_parameters() if v.requires_grad]
non_bert_params = {'params': [v for k, v in params if not k.startswith('bert.')]}
bert_params = {'params': [v for k, v in params if k.startswith('bert.')], 'lr': BERT_LR}
optimizer = torch.optim.Adam([non_bert_params, bert_params], lr=LR)

handle Doc I just take pre 520 tokens 520-[CLS]-2*[SEP]-qlen

But it still have some little different. Dev NDCG@20:0.37536 P@20:0.43176

seanmacavaney commented 4 years ago

It sounds like you are trying to debug your implementation? I cannot really help without the code you are using itself. I'd recommend continuing to replace components in your implementation with those found in this repository. Specifically, handling longer documents is probably important.

sean

Georgetown-IR-Lab / cedr

Cannot reproduce VanillaBERT's result on Robust04 #12