Training my own metric - Githubissues

NourKhdour commented 2 years ago

I want to train a ranker model for the evaluation of English-Arabic data. The data format contains "src, mt, ref, score" based on https://github.com/Unbabel/COMET/blob/0.1.0/docs/source/training.md

when I use comet-train to train my own evaluation metric using "bert-base-multilingual-cased", I get the following error:

Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Encoder model frozen. Traceback (most recent call last): File "/home/nour/anaconda3/bin/comet-train", line 8, in sys.exit(train_command()) File "/home/nour/anaconda3/lib/python3.9/site-packages/comet/cli/train.py", line 107, in train_command trainer.fit(model) File "/home/nour/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit self._run(model) File "/home/nour/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 713, in _run self.call_setup_hook(model) # allow user to setup lightning_module in accelerator environment File "/home/nour/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in call_setup_hook model.setup(stage=fn) File "/home/nour/anaconda3/lib/python3.9/site-packages/comet/models/base.py", line 399, in setup self.train_dataset = self.read_csv(self.hparams.train_data) File "/home/nour/anaconda3/lib/python3.9/site-packages/comet/models/ranking/ranking_metric.py", line 215, in read_csv df = df[["src", "pos", "neg", "ref"]] File "/home/nour/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 2912, in getitem indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] File "/home/nour/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) File "/home/nour/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1304, in _validate_read_indexer raise KeyError(f"{not_found} not in index") KeyError: "['pos', 'neg'] not in index"

What is your question?

What is the data format for training my own ranker metric? and what is ['pos', 'neg']? What is the scale for the "score" column in the data?

ricardorei commented 2 years ago

Hi @NourKhdour, this is not well documented and the documentation pages are really outdated. Sorry about that!

Basically, the Ranking model requires a different data format. It requires "relative-ranks" data. This means that for each source and reference sentence, you are required to have a "positive sample" and a "negative sample".

The csv should be something like:

src	ref	pos	neg
"Olá Mundo!"	"Hello World!"	"Hello World"	"Hi world."

The positive sample is basically a superior translation. This is typically derived from DA annotations when, for the same segment, there is a difference of 25 points between annotations.

ricardorei commented 2 years ago

You don't need any score column. The model will learn from a contrastive loss.

NourKhdour commented 2 years ago

So the "positive sample" supposed to be better than the "reference" or not?

NourKhdour commented 2 years ago

@ricardorei I have the following data:

MT Raw means the Machine Translation output MTPE means Machine Translation output post-edited

source_text	MT Raw	MTPE	Levenstion	BLEU	TER	CHRF
ه. لا تتحمل الوزارة أي أتعاب أو مستحقات مالية أو خلافة لفريق العمل.	e. The Ministry shall not bear any fees, financial dues or succession to the work team	e. The Ministry shall not bear any fees, financial dues, etc. related to the work team.	13	71.22	18.75	79.75

and i want to transform it into "relative-ranks" format as follows: src = 'source_text'. ref = 'MTPE' pos = 'MTRaw', if the score indicates that the difference between MT raw and MTPE is little neg = generate a translation using a worst MT model

Can it work? And what is the best metric we can use to choose 'pos' sentences? I would be happy if you had other suggestions to generate positive and negative samples without effort from translators.

ricardorei commented 2 years ago

To answer your first question:

So` the "positive sample" supposed to be better than the "reference" or not?

Not really. The positive sample has to be better than the negative sample (closer to the reference meaning) but it does not need to be better than the reference.

In fact, the reference and the positive sample can be permuted. There is a parameter in the TripletMarginLoss to do that. We don't use it inside COMET tho.. but it's just to say that the positive sample and the reference have this relationship that is symmetric.

Regarding your second question:

Can it work?

Basically what you are trying to do is to build contrastive data. It can work... the challenge is to create "hard" negatives for the model to learn something meaningful. There is a lot of literature on contrastive learning that tries to address these challenges...

There is a lot of challenges in what you are trying to do: 1) If you use MTPE as references and the raw MT as positive the model might learn to simply "approximate" those sentences based on lexical overlaps. 2) If you generate a negative sample using the worst MT how do you guarantee that the translation is actually worse?

Ideally, you want your positive samples to be semantically correct but lexically diverse (in comparison to the reference). At the same time, you want to guarantee that your negative samples are in fact worse translations (which might be challenging)

With that said, I think these are interesting things to investigate... When we first launched COMET I believed that this approach based on the ranking was super promising. Then I started working more on the "regression" models because the data is easier to find and build but I still have hopes for the ranking approach

NourKhdour commented 2 years ago

@ricardorei I trained my own regression model, and when i run the following command to use my metric.

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

I get the following error: comet-score: error: argument --model: invalid choice: '/lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt' (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-comet-mqm', 'wmt21-cometinho-da', 'wmt21-comet-qe-mqm')

ricardorei commented 2 years ago

which version are you using?

ricardorei commented 2 years ago

You should be able to run it but I was checking the code and the message error should actually be more explicit.

You are probably passing an invalid path.

try: lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt instead of: /lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt

NourKhdour commented 2 years ago

The same thing even if i passed the full path!

ricardorei commented 2 years ago

which version are you using? is it the master code?

NourKhdour commented 2 years ago

Yes, the last commit of master "c772b67"

ricardorei commented 2 years ago

hmm, that seems strange. Can you put a breakpoint in L161 and check the if condition? Its basically failing on os.path.exists(cfg.model) I just can't figure out why... Locally I can use it by passing the path to a checkpoint.

NourKhdour commented 2 years ago

Thank you so much :)

Unbabel / COMET

Training my own metric #61

What is your question?