[QUESTION] OOM when load XCOMET-XXL in A100 with 40G memory for prediction #213

Nie-Yingying commented 2 months ago

I can predict scores with only cpu successfully. But when loaded model to gpu, there is oom error.


from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/XCOMET-XXL")

model_path = "./XCOMET-XXL/checkpoints/model.ckpt" model = load_from_checkpoint(model_path,reload_hparams=True) data = [ { "src": "Boris Johnson teeters on edge of favour with Tory MPs", "mt": "Boris Johnson ist bei Tory-Abgeordneten völlig in der Gunst", "ref": "Boris Johnsons Beliebtheit bei Tory-MPs steht auf der Kippe" } ] model_output = model.predict(data, batch_size=1, gpus=1)

Segment-level scores

print (model_output.scores)

System-level score

print (model_output.system_score)

Score explanation (error spans)

print (model_output.metadata.error_spans)

hparams.yaml image

ricardorei commented 2 months ago

Hi @Nie-Yingying!

I have a suggestion to run XCOMET-XXL in a 40GB but its still not integrated. In the file: comet/encoders/xlmr_xl.py

Replace the model init to load in 16bits:

def __init__(
        self, pretrained_model: str, load_pretrained_weights: bool = True
    ) -> None:
        super(Encoder, self).__init__()
        self.tokenizer = XLMRobertaTokenizerFast.from_pretrained(pretrained_model)
        if load_pretrained_weights:
            self.model = XLMRobertaXLModel.from_pretrained(
                pretrained_model, add_pooling_layer=False
            print ("Loading model in f16")
            self.model = XLMRobertaXLModel(
                XLMRobertaXLConfig.from_pretrained(pretrained_model, torch_dtype=torch.float16, device_map="auto"),
        self.model.encoder.output_hidden_states = True
ricardorei commented 2 months ago

this will load the model with half its memory and should solve your problem. I'll integrate this soon

vince62s commented 2 months ago

@ricardorei I did something very similar for the XL. I actually converted it in fp16 then I just changed one line in the feedforward.py But after I wanted to go even further and use bitsandbytes/HF load_in_8_bit / load_in_4_bit = True but the integration is a mess between lightning and HF. Last, FYI I did this as a WIP: https://huggingface.co/vince62s/wmt23-cometkiwi-da-roberta-xl adapting your code in the existing HF XLM-roberta-XL code. We are trying to implement it in CTranslate2 for much faster inference.

Nie-Yingying commented 2 months ago

sorry to tell you and it's still oom image