How to reproduce Unbabel/wmt22-comet-da model

babangain commented 1 year ago

I want to train the model Unbabel/wmt22-comet-da model from scratch. It will be very helpful for me if someone provides the list of datasets, pre-processing steps, and hyperparameters.

ricardorei commented 1 year ago

hyper-parameters are these ones and regarding the data we used all DA data here including the MLQE-PE data.

For making the model output a score between 0 and 1 you just have to rescale the data after concatenating everything. The rescaling is the following:

1) Find a reasonable "min value". This is done by finding all annotations with more than 1 annotator where all annotators agreed that the score was 0. Then your "min_value" is the average z-score for those segments. 2) Find a reasonable "max value". This is done by finding all annotations with more than 1 annotator where all annotators agreed that the translation is perfect (100 score). Then your "max_value" is the average z-score for those segments. 3) Apply a min_max_scaler to your data and truncate every score above 1 and bellow 0.

This rescaling won't impact your model correlations and your model will output scores in that range. This is the same rescaling used in BLEURT also.

ricardorei commented 1 year ago

Btw the seed used is 91 if I am not mistaken... comet-train --cfg YOUR_CONFIGS.yaml --seed_everything 91

Unbabel / COMET

How to reproduce Unbabel/wmt22-comet-da model #138