UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.54k stars 2.41k forks source link

Errors in inference cross-encoders #568

Closed iknoorjobs closed 3 years ago

iknoorjobs commented 3 years ago

Hi

I finetuned the cross encoders model using one of the huggingface model (link) on the sts dataset using your training script. I loaded the model using the command and it shows the following warning.

model = CrossEncoder('lordtt13/COVID-SciBERT', num_labels=1)

Some weights of the model checkpoint at lordtt13/COVID-SciBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at lordtt13/COVID-SciBERT and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Now, when I use the model after training, 1) It is comparatively slow during inference time as compared to cross-encoder models provided by sentence-transformer. 2) It gives the following error for some longer input pairs. RuntimeError: The size of tensor a (535) must match the size of tensor b (512) at non-singleton dimension 1

Could you please tell why is this happening or if I am missing something?

Many thanks Iknoor

nreimers commented 3 years ago

Hi @iknoorjobs

1) The performance mainly depends on the number of layers. I sadly don't know which base model they use (like DistilBERT or bert-base or bert-large), but this has the most impact on the performance. If you compare the same model type (distil, base, large), you should get the same inference time.

2) This indicates that the input is not truncated. Some old config version files on Huggingface do not specify the max length for the input. In that case, a 535 word piece text is passed to the model, but the model only supports inputs up to 512 word pieces.

When you install sentence-transformers from source, you can load the model like this:

model = CrossEncoder('model_name', max_length=512)

The max_length parameter will be part of the next release (0.3.9)

iknoorjobs commented 3 years ago

Hi @nreimers

Many thanks for your response.

The previous error is gone but if I again try to load some nboost model for training cross-encoder. It shows the following error when evaluating the model on dev set.

Code for loading model:

model = CrossEncoder('nboost/pt-biobert-base-msmarco', max_length=512)

Error during training when evaluating on dev set:

2020-11-17 19:16:55 - CECorrelationEvaluator: Evaluating the model on sts-dev dataset after epoch 0:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-a97b57ce0904> in <module>
     13 
     14 # Train the model
---> 15 model.fit(train_dataloader=train_dataloader,
     16           evaluator=evaluator,
     17           epochs=num_epochs,

~/sentence-transformers/sentence_transformers/cross_encoder/CrossEncoder.py in fit(self, train_dataloader, evaluator, epochs, loss_fct, acitvation_fct, scheduler, warmup_steps, optimizer_class, optimizer_params, weight_decay, evaluation_steps, output_path, save_best_model, max_grad_norm, use_amp, callback)
    206 
    207             if evaluator is not None:
--> 208                 self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback)
    209 
    210 

~/sentence-transformers/sentence_transformers/cross_encoder/CrossEncoder.py in _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback)
    278         """Runs evaluation during the training"""
    279         if evaluator is not None:
--> 280             score = evaluator(self, output_path=output_path, epoch=epoch, steps=steps)
    281             if callback is not None:
    282                 callback(score, epoch, steps)

~/sentence-transformers/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py in __call__(self, model, output_path, epoch, steps)
     43 
     44 
---> 45         eval_pearson, _ = pearsonr(self.scores, pred_scores)
     46         eval_spearman, _ = spearmanr(self.scores, pred_scores)
     47 

~/anaconda3/envs/sbert/lib/python3.8/site-packages/scipy/stats/stats.py in pearsonr(x, y)
   3854         return dtype(np.sign(x[1] - x[0])*np.sign(y[1] - y[0])), 1.0
   3855 
-> 3856     xmean = x.mean(dtype=dtype)
   3857     ymean = y.mean(dtype=dtype)
   3858 

~/anaconda3/envs/sbert/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
    149             is_float16_result = True
    150 
--> 151     ret = umr_sum(arr, axis, dtype, out, keepdims)
    152     if isinstance(ret, mu.ndarray):
    153         ret = um.true_divide(

TypeError: No loop matching the specified signature and casting was found for ufunc add
nreimers commented 3 years ago

Hi @iknoorjobs

nboost/pt-biobert-base-msmarco is outputting two scores, first score for "not_relevant", the second for "relevant". Sadly this is not compatible with CECorrelationEvaluator. CECorrelationEvaluator expects that the model outputs only a single score and compares the single score with the gold score using Spearman rank correlation.

iknoorjobs commented 3 years ago

Hi @nreimers Ah, ok. Thanks. I also checked the output of the model to see if it can be converted to single output but the output is very different and doesn't add to one.

Closing this now. Thanks.

iknoorjobs commented 3 years ago

Hi @nreimers

Is it possible to train the nboost or other passage reranking models (which gives two scores "not_relevant" and "relevant") using the latest cross-encoder scripts CERerankingEvaluator? I have a dataset in the format given below (score from 0 to 1) and I want to fine-tune these passage reranking models on this dataset.

["Query", "passage", score]

Many thanks.

nreimers commented 3 years ago

Hi @iknoorjobs Yes, you can find an example here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder.py

It is based on the MS Marco dataset, where you have relevant passages [query, passage, 1] and irrelevant passages [query, passage, 0].

The score could also be somewhere between 0 and 1

iknoorjobs commented 3 years ago

Hi @nreimers

Thank you for your response. Still, if I try to load the nboost model, it shows the following error.

model = CrossEncoder("nboost/pt-bert-base-uncased-msmarco", num_labels=1, max_length=512)

RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
    size mismatch for classifier.weight: copying a param with shape torch.Size([2, 768]) from checkpoint, the shape in current model is torch.Size([1, 768]).
    size mismatch for classifier.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).

When I load this model by changing the num_labels to 2, then it works. But after training 1 epoch on the above data format ["Query", "passage", score] ), it shows error during evaluation as we only have a single label from the data. Does it make sense to convert the data for the format ( ["Query", "passage", 1-score, score] because now we have 2 labels for "not_relevant" and "relevant"?

Thanks

nreimers commented 3 years ago

Hi @iknoorjobs The Nboost models have the issue that they use 2 labels (relevant and not relevant). If you want to use this model as base, you need binary labels, i.e. int(0) and int(1) as labels.

I will release today improved cross-encoder models for MS Marco that 1) Are quicker than the nboost models 2) Achieve a better performance on MS Marco & TREC DL 2019 dataset 3) And only use a single output to indicate if query and passage are relevant

iknoorjobs commented 3 years ago

I will release today improved cross-encoder models for MS Marco that

@nreimers Fantastic news! Very much looking forward to the models today.

If you want to use this model as base, you need binary labels, i.e. int(0) and int(1) as labels.

Can your cross encoder training scripts be used to train the model if I have dataset with binary labels?

Many thanks

nreimers commented 3 years ago

@iknoorjobs The models are now online: https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/information-retrieval

Can your cross encoder training scripts be used to train the model if I have dataset with binary labels? Yes. The MS Marco dataset had only binary labels (relevant or not relevant), which were encoded as 1 and 0.

It will also work if you have more fine-grained labels, like 0, 0.5, 0.8 and 1

iknoorjobs commented 3 years ago

@nreimers Thanks a lot. And I must say, your work is great. Also, looking forward to the paper.