Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
441 stars 72 forks source link

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL #195

Open emjotde opened 5 months ago

emjotde commented 5 months ago

Hi, I have been playing around with re-implementing some of your models in Marian and while progressing through the code I noticed that you are not actually using sparsemax for Comet-KIWI and Comet-XL/XXL, instead you are falling back to a softmax.

In both cases you forgot to pass the layer_transformation parameter to its base class:

See here for UnifiedMetric https://github.com/Unbabel/COMET/blob/2bcf66604b30dcde98565854d5f36026c19f580a/comet/models/multitask/unified_metric.py#L106

and here for XCOMETMetric https://github.com/Unbabel/COMET/blob/2bcf66604b30dcde98565854d5f36026c19f580a/comet/models/multitask/xcomet_metric.py#L54

In both cases the layer_transformation parameter does not appear in the parameter list of the base class below, but the base class has softmax as the default.

In my re-implementation I am reproducing your exact numbers for Comet-KIWI with a softmax, not the sparsemax. While the sparsemax works fine for COMET-22 ref-based.

It's not clear to me if the model was trained with a softmax or sparsemax, but you might either have a train/inference mismatch here or at the very least your models are doing something different than you expected/described.

emjotde commented 5 months ago

Follow-up on that... I am also wondering if you realized that Roberta-XL and Roberta-XXL are pre-norm, while the base model you used for Comet-KIWI is post-norm, but you treat them the same during training/inference. The huggingface implementation is collecting the hidden states without normalization for the XL models with the exception of very last hidden state which is normed.

That seems to mean that the hidden states that you use for your layer-mixing have wildly different magnitudes across layers -- the first and the last one (the most important one?) have very small norms, the ones in-between are unnormed. I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

ricardorei commented 5 months ago

@emjotde nothing like a re-implementation challenge to find bugs 😄... I just confirmed and you are right. Its defaulting to softmax instead of sparsemax.

>>> from comet import download_model, load_from_checkpoint
>>> model = load_from_checkpoint(download_model("Unbabel/wmt23-cometkiwi-da-xxl"))
>>> model.layerwise_attention.transform_fn
<built-in method softmax of type object at 0x7fda5cbd2460>
>>> model.layerwise_attention.layer_norm
False

same thing for XCOMET models.

Regarding Roberta-XL and XXL I realised the change from post-norm to pre-norm. I did not realised the impact on the embeddings returned from HF. Actually HF took a long long time to integrate Roberta-XL/XXL because of this issue... but I never inspected the magnitudes across layers.

Btw the rational for using sparsemax instead of softmax was not performance related. Our goal when integrating Sparsemax was to study if all layers are relevant or not. The performance between sparsemax and softmax is usually the same. Yet, for wmt22-comet-da, because of sparsemax, we can clearly observe which layers are relevant:

e.g:

>>> model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0849, 0.0738, 0.0504, 0.0463, 0.0166, 0.0125, 0.0103, 0.0027, 0.0000,
        0.0000, 0.0007, 0.0088, 0.0151, 0.0463, 0.0591, 0.0466, 0.0516, 0.0552,
        0.0581, 0.0621, 0.0666, 0.0609, 0.0621, 0.0645, 0.0448],
       grad_fn=<SparsemaxFunctionBackward>)

Here we can see that some layers are set to 0 and thus ignored. This provides some layer of interpretability... Ideally, the model would ignore the top layers and we could, after training, prune those (unfortunately this usually does not happen).

With XCOMET, the learned weights are all very similar.... But like you said probably because of the different norms?

>>> model = load_from_checkpoint(download_model("Unbabel/XCOMET-XL"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0285, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267,
        0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0268, 0.0268,
        0.0268, 0.0268, 0.0268, 0.0269, 0.0270, 0.0271, 0.0271, 0.0272, 0.0273,
        0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0272,
        0.0287], grad_fn=<SoftmaxBackward0>)

Also, not sure if you noticed but we only use the layerwise attention for creating the sentence-embedding that are used for regression. The embeddings used for classifying the individual tokens as error spans are those from the word_layer (model.hparams.word_layer). We have not played a lot with this hyper-parameters but our goal was to make an individual layer more specialised on that task (usually a top layer because its closer to the MLM objective) while for regression we would like to pool information from all layers.

I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

It did not... I was actually surprised but the training was very stable from the get go.... I had some issues with distributed training and pytorch-lightning and ended up implementing something without lightning but after that was done, training was smooth.

emjotde commented 5 months ago

Yeah, I am currently not looking at the word-level predictions yet, stopped at the regressor-implementation.

Regarding the weights above, the fact that they are near-uniform after softmax despite the that the norms over the hidden states are so different is what made me wonder if proper learning happens or rather some form of saturation (always hard to tell with those neural models).

I would have expected the model to strongly push down the weights for the models with high norms. On the other hand, if this becomes bascially an unweighted arithmetic average then the two very small vectors pull everything down by a lot considering that averages reward outliers. Who knows...

ricardorei commented 5 months ago

Its the black magic art of NN 🙂