Closed NourKhdour closed 2 years ago
Hi @NourKhdour, this is not well documented and the documentation pages are really outdated. Sorry about that!
Basically, the Ranking model requires a different data format. It requires "relative-ranks" data. This means that for each source and reference sentence, you are required to have a "positive sample" and a "negative sample".
The csv should be something like:
src | ref | pos | neg |
---|---|---|---|
"Olá Mundo!" | "Hello World!" | "Hello World" | "Hi world." |
The positive sample is basically a superior translation. This is typically derived from DA annotations when, for the same segment, there is a difference of 25 points between annotations.
You don't need any score
column. The model will learn from a contrastive loss.
So the "positive sample" supposed to be better than the "reference" or not?
@ricardorei I have the following data:
MT Raw means the Machine Translation output MTPE means Machine Translation output post-edited
source_text | MT Raw | MTPE | Levenstion | BLEU | TER | CHRF |
---|---|---|---|---|---|---|
ه. لا تتحمل الوزارة أي أتعاب أو مستحقات مالية أو خلافة لفريق العمل. | e. The Ministry shall not bear any fees, financial dues or succession to the work team | e. The Ministry shall not bear any fees, financial dues, etc. related to the work team. | 13 | 71.22 | 18.75 | 79.75 |
and i want to transform it into "relative-ranks" format as follows: src = 'source_text'. ref = 'MTPE' pos = 'MTRaw', if the score indicates that the difference between MT raw and MTPE is little neg = generate a translation using a worst MT model
Can it work? And what is the best metric we can use to choose 'pos' sentences? I would be happy if you had other suggestions to generate positive and negative samples without effort from translators.
To answer your first question:
So` the "positive sample" supposed to be better than the "reference" or not?
Not really. The positive sample has to be better than the negative sample (closer to the reference meaning) but it does not need to be better than the reference.
In fact, the reference and the positive sample can be permuted. There is a parameter in the TripletMarginLoss to do that. We don't use it inside COMET tho.. but it's just to say that the positive sample and the reference have this relationship that is symmetric.
Regarding your second question:
Can it work?
Basically what you are trying to do is to build contrastive data. It can work... the challenge is to create "hard" negatives for the model to learn something meaningful. There is a lot of literature on contrastive learning that tries to address these challenges...
There is a lot of challenges in what you are trying to do: 1) If you use MTPE as references and the raw MT as positive the model might learn to simply "approximate" those sentences based on lexical overlaps. 2) If you generate a negative sample using the worst MT how do you guarantee that the translation is actually worse?
Ideally, you want your positive samples to be semantically correct but lexically diverse (in comparison to the reference). At the same time, you want to guarantee that your negative samples are in fact worse translations (which might be challenging)
With that said, I think these are interesting things to investigate... When we first launched COMET I believed that this approach based on the ranking was super promising. Then I started working more on the "regression" models because the data is easier to find and build but I still have hopes for the ranking approach
@ricardorei I trained my own regression model, and when i run the following command to use my metric.
comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT
I get the following error:
comet-score: error: argument --model: invalid choice: '/lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt' (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-comet-mqm', 'wmt21-cometinho-da', 'wmt21-comet-qe-mqm')
which version are you using?
You should be able to run it but I was checking the code and the message error should actually be more explicit.
You are probably passing an invalid path.
try: lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt instead of: /lightning_logs/version_5/checkpoints/epoch=4-step=999.ckpt
The same thing even if i passed the full path!
which version are you using? is it the master code?
Yes, the last commit of master "c772b67"
hmm, that seems strange. Can you put a breakpoint in L161 and check the if condition? Its basically failing on os.path.exists(cfg.model)
I just can't figure out why... Locally I can use it by passing the path to a checkpoint.
Thank you so much :)
I want to train a ranker model for the evaluation of English-Arabic data. The data format contains "src, mt, ref, score" based on https://github.com/Unbabel/COMET/blob/0.1.0/docs/source/training.md
when I use comet-train to train my own evaluation metric using "bert-base-multilingual-cased", I get the following error:
Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
What is your question?
What is the data format for training my own ranker metric? and what is ['pos', 'neg']? What is the scale for the "score" column in the data?