UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

Multilingual NLI / Textual entailment using teacher-student models #1163

Closed lighteternal closed 3 years ago

lighteternal commented 3 years ago

Hi,

I am trying to create a model for Greek textual entailment. Since there's no Greek NLI dataset for the moment I have to either: 1) use multilingual pretrained models that have bad out-of-the-box performance in most cases or 2) translate Greek sentences to English, before feeding them to the pretrained models

I was wondering if there's a way to train such an NLI model for Greek, following the student-teacher example for STS that leverages parallel sentences (which I have already from previous NMT tasks). Is it possible to do this by altering the make_multilingual.py script, for example? If yes, which teacher and student models should I use? I also assume that a relevant evaluator must be added.

Many thanks for your time! :)

Edit: I noticed that there's a proposed textual entailment pre-trained model (nli-deberta-base) available. So, I tried to load it to the make_multilingual.py script using the following teacher-model setup:

teacher_model_name = 'cross-encoder/nli-deberta-base'   
student_model_name = 'xlm-roberta-base'    

but it appears (naturally) that as the trained xlm-roberta model does not have the correct head, it fails at inference:

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-2-7d851b07dd4f> in <module>
      6 #Convert scores to labels
      7 label_mapping = ['contradiction', 'entailment', 'neutral']
----> 8 labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
      9 
     10 scores

AxisError: axis 1 is out of bounds for array of dimension 1

Any hint on an appropriate teacher-student setup is much appreciated!

nreimers commented 3 years ago

The teacher-student setup from multilingual knowledge distillation just makes sense for embedding models. Not really for NLI.

Take a multilingual model (like xlm-roberta-base), translate your NLI train data to Greek, then train on English+Greek NLI data using a cross encoder: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/cross-encoder/training_nli.py

lighteternal commented 3 years ago

Thank you @nreimers! As a final question, do you think it's better to train a model (e.g. xlm-roberta) from scratch on a mix of English-Greek data, or to finetune an existing (if any) multilingual model using only the Greek (aka translated) subset with the above script?

lighteternal commented 3 years ago

@nreimers Following your advice, I translated the AllNLI dataset to Greek, and I used the training_nli cross-encoder script to train on the bi-lingual English-Greek AllNLI dataset. However, it seems that the reported accuracy is pretty low, fluctuating at around 50-60% after a few iterations:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

epoch | steps | Accuracy -- | -- | -- 0 | 10000 | 0.544991706009953 0 | 20000 | 0.550950618859257 0 | 30000 | 0.556998851601378 0 | 40000 | 0.550376419548297 0 | 50000 | 0.558989409212709 0 | 60000 | 0.560533367359959 0 | 70000 | 0.56220492535409 0 | 80000 | 0.575768789077453 0 | 90000 | 0.572259793288248 0 | 100000 | 0.573561311726426 0 | 110000 | 0.58319510016588 0 | 120000 | 0.582136021436774 0 | 130000 | 0.573625111649866 0 | 140000 | 0.572910552507337 0 | 150000 | 0.57881842541789 0 | 160000 | 0.581383182340181 0 | 170000 | 0.581855301773638 0 | 180000 | 0.582837820594615 0 | 190000 | 0.59323720811535 0 | 200000 | 0.587482455021054 0 | 210000 | 0.585058057930331 0 | 220000 | 0.589396452724257 0 | 230000 | 0.598507081791502 0 | -1 | 0.598609161669006

The training parameters are as follows:

 train_batch_size = 10 #increasing leads to CUDA memory error
num_epochs = 1

model = CrossEncoder('xlm-roberta-base', num_labels=len(label2int))

Shouldn't the accuracy be at the range of 80-85% at least? Needless to say that adding more epochs doesn't help.

nreimers commented 3 years ago

Yes, that is quite low. Check your translation, training and evaluation setup

lighteternal commented 3 years ago

Well, sometimes the simplest advice is the most useful. I had messed up with the stacking of the two dataframes (the Greek and English ones) producing random pairs. Everything works as intended (acc ~84% after first epoch). I will upload the model (and dataset) on HuggingFace after finetuning.

Many thanks for the help @nreimers! :100: