bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.32k stars 634 forks source link

Training NMT models? #99

Closed ymoslem closed 1 year ago

ymoslem commented 2 years ago

Hello! Thanks, Tim! I tried bitsandbytes for language models like BLOOM, and it works well.

I have a question about NMT models like NLLB, M2M, mBART, or OPUS. I tried inference for NLLB, and apparently it is not supported. Are any of these models supported for inference, and especially for fine-tuning?

Many thanks!

younesbelkada commented 2 years ago

hi @ymoslem , NLLB should be supported, could you make sure you are using the latest version of transformers? pip install --upgrade transformers ? Could you also share a code snippet that you are using to load NLLB in 8-bit? Thanks!

ymoslem commented 2 years ago

Thanks, @younesbelkada! Now, it works with transformers==4.24.0, bitsandbytes==0.35.4 and accelerate==0.14.0 after running:

!pip3 install --upgrade transformers bitsandbytes accelerate

However, there are a few observations: 1- It is not faster than float16. When loading nllb-200-distilled-600M, with load_in_8bit=True, it takes 17.1 seconds, while with torch_dtype=torch.float16, it takes 17.9 seconds. 2- When adding int8_threshold=2.0, I got "an unexpected keyword argument" error. It seems that AutoModelForSeq2SeqLM does not support it. 3- GPU consumption seems the similar in both cases; 4871MB with 8bit and 4151MB with float16.

The aforementioned results are on an NVIDIA RTX A4000 GPU. I tried also on Google Colab with Tesla T4, and got similar conclusions. I tried also with the facebook/nllb-200-3.3B model

8bit

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M",
                                              low_cpu_mem_usage=True,
                                              device_map= "auto",
                                              load_in_8bit= True)

source_text = 'Chinese clinical trials in Wuhan and Shenzhen claimed to show that favipiravir was "clearly effective".'
inputs = tokenizer(source_text, return_tensors="pt").to("cuda")

translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], max_length=30
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

float16

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M",
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,)
model = model.half()
model.to("cuda")

source_text = 'Chinese clinical trials in Wuhan and Shenzhen claimed to show that favipiravir was "clearly effective".'
inputs = tokenizer(source_text, return_tensors="pt").to("cuda")

translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], max_length=30
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
younesbelkada commented 2 years ago

Hi @ymoslem Thanks a lot for your message

1- It is not faster than float16. When loading nllb-200-distilled-600M, with load_in_8bit=True, it takes 17.1 seconds, while with torch_dtype=torch.float16, it takes 17.9 seconds.

Yes this is expected, the 8-bit is currently slower than the fp16 model because the 8-bit quantization is done in two stages. You can check out more about that on the 8-bit integration blogpost.

2- When adding int8_threshold=2.0, I got "an unexpected keyword argument" error. It seems that AutoModelForSeq2SeqLM does not support it.

Yes, please use load_in_8bit_threshold instead. Could you point me to the place you have read that says to use int8_threshold? Maybe the documentation has not been updated

3- GPU consumption seems the similar in both cases; 4871MB with 8bit and 4151MB with float16.

Could you share with me how do you measure that? Note that the memory optimization between fp16 and int8 model really depends on the model size, for nllb-600M you get a memory footpint saving of 1.18, for 3.3B you get a saving of 1.41, etc and it linearly grows with the size of the model. You can check that with this snippet:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

model_id = "facebook/nllb-200-3.3B"

model = AutoModelForSeq2SeqLM.from_pretrained(model_id,device_map= "auto",load_in_8bit=False, torch_dtype=torch.float16)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id,device_map= "auto",load_in_8bit=True)
print(model.get_memory_footprint() / model_8bit.get_memory_footprint())
ymoslem commented 2 years ago

Many thanks, @younesbelkada for the detailed explanation!

Note that the memory optimization between fp16 and int8 model really depends on the model size, for nllb-600M you get a memory footpint saving of 1.18, for 3.3B you get a saving of 1.41, etc

I confirm this result on Google Colab, when loading the model only.

What I was trying earlier (and reported in the previous reply) was running everything until the translation, and checking nvidia-smi.

Yes, please use load_in_8bit_threshold instead. Could you point me to the place you have read that says to use int8_threshold? Maybe the documentation has not been updated

I think the documentation is updated, but maybe there is an old notebook that appears in search.


Regarding fine-tuning, is there any value in using bitsandbytes for fine-tuning?

I am using this code, but it gives the following error:

from transformers import TrainingArguments, Trainer, logging
from torch.utils.checkpoint import checkpoint

training_args = TrainingArguments(
    output_dir="run",
    num_train_epochs=40,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    eval_accumulation_steps=4,
    gradient_checkpointing=True,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_finetune,
    eval_dataset=tokenized_validate,
)
trainer.train()

Exception: State must contain either CBt or CB matrix for backward

Thanks again!

younesbelkada commented 2 years ago

Hi @ymoslem Thanks a lot for your message! Indeed, it is not possible for now to train any 8bit model using transformers - we are currently seeing if we can apply LoRA (Low Rank Adaptators) on 8-bit models using transformers but it is under discussion. We'll keep you posted