Quantized model to run on a single 3090

MichaelDoron commented 1 year ago

Great work!

Are there plans to quantize the model, to allow it to run on a single 3090 GPU?

Thanks!

Luodian commented 1 year ago

Yes sure, we already have fp16 model that cost 16GB mem, but the performance was little bit affected.

@cliangyu

MichaelDoron commented 1 year ago

That's great, I didn't realize the fp16 model existed! Can you please refer me to its link, as I couldn't find it? I understand the performance will be slightly worse, but I want to give it a try locally if that's possible.

cliangyu commented 1 year ago

We have not officially released a fp16 model, but you can make it in one minute with the code below.

from otter import OtterForConditionalGeneration
import torch

load_bit = "fp16"

if load_bit == "fp16":
    precision = {"torch_dtype": torch.float16}
elif load_bit == "bf16":
    precision = {"torch_dtype": torch.bfloat16}

checkpoint_path = 'your_ckpt_path, probably hugging face model'
model = OtterForConditionalGeneration.from_pretrained(checkpoint_path, device_map='auto', **precision)

# save model
checkpoint_path = checkpoint_path + f"_{load_bit}"
OtterForConditionalGeneration.save_pretrained(model, checkpoint_path)

Luodian / Otter

Quantized model to run on a single 3090 #148