Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
493 stars 76 forks source link

Training data and scripts used for wmt22-cometkiwi-da #217

Open rohitk-cognizant opened 5 months ago

rohitk-cognizant commented 5 months ago

Hi Team,

Can you share the training data and training scripts used for wmt22-cometkiwi-da. We want it reference for training with our own sample reference data.

ricardorei commented 5 months ago

Hi @rohitk-cognizant,

To train wmt22-cometkiwi-da you just have to run:

comet-train --cfg configs/models/{your_model_config}.yaml

Your configs should be something like this:

unified_metric:
  class_path: comet.models.UnifiedMetric
  init_args:
    nr_frozen_epochs: 0.3
    keep_embeddings_frozen: True
    optimizer: AdamW
    encoder_learning_rate: 1.0e-06
    learning_rate: 1.5e-05
    layerwise_decay: 0.95
    encoder_model: XLM-RoBERTa
    pretrained_model: microsoft/infoxlm-large
    sent_layer: mix
    layer_transformation: sparsemax
    word_layer: 24
    loss: mse
    dropout: 0.1
    batch_size: 16
    train_data: 
      - TRAIN_DATA.csv
    validation_data: 
      - VALIDATION_DATA.csv
    hidden_sizes:
      - 3072
      - 1024
    activations: Tanh
    input_segments:
      - mt
      - src
    word_level_training: False

trainer: ../trainer.yaml
early_stopping: ../early_stopping.yaml
model_checkpoint: ../model_checkpoint.yaml
rohitk-cognizant commented 5 months ago

Hi @ricardorei ,

Thanks for the update. Can I use the same training parameters mentioned in master branch trainer.yaml file?

ricardorei commented 5 months ago

Hmm maybe you should change them a bit. For example to train on a single GPU (which is usually faster) and with precision 16 use this:

  accelerator: gpu
  devices: 1
  # strategy: ddp # Comment this line for distributed training
  precision: 16

You might also want to consider reducing the accumulate_grad_batches to 2 instead of 8

  accumulate_grad_batches: 2
satya77 commented 1 month ago

What is the format that the data should look like?