Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
493 stars 76 forks source link

Avoid downloading XLM-R checkpoint from huggingface #116

Closed K024 closed 1 year ago

K024 commented 1 year ago

🚀 Feature

Allow users to download only the COMET checkpoint without the XLM-R checkpoint from huggingface hub.

Motivation

Currently, all encoders in COMET repo are loaded by XXModel.from_pretrained, which also downloads model weights from huggingface. For end users without the demand to train a custom model, the weights are not needed. Code reference:

https://github.com/Unbabel/COMET/blob/9a84de1e7efc9966822ad786f86c3b5514cf824d/comet/encoders/xlmr.py#L35-L41

The code can be refactored with XXConfig.from_pretrained like:

    def __init__(self, pretrained_model: str, load_pretrained_weights: bool=False) -> None:
        super(Encoder, self).__init__()
        self.tokenizer = XLMRobertaTokenizer.from_pretrained(pretrained_model)
        if load_pretrained_weights:
            self.model = XLMRobertaModel.from_pretrained(
                pretrained_model, add_pooling_layer=False
            )
        else:
            self.model = XLMRobertaModel(
                XLMRobertaConfig.from_pretrained(pretrained_model),
                add_pooling_layer=False
            )
        self.model.encoder.output_hidden_states = True

This avoids downloading unnecessary weights and only keeps the important config file.

Alternatives

A better approach for ease of use is to export the whole model to ONNX format, but requires much more effort.

Additional context

No

ricardorei commented 1 year ago

@K024 I refactored the code to avoid doing it. Thanks for the issue.