artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.96k stars 820 forks source link

Saving/Loading qlora adapters #278

Open chrisi2045 opened 10 months ago

chrisi2045 commented 10 months ago

Hello, my situation is as follows:

I implemented qlora adapter to use with LLMs (currently bloom-560m). It works fine so far, after fine-tuning I get over 90% accuracy on my task. However, after saving and loading the adapter, the accuracy drops down to slightly above guessing.

I assume it is connected to the following behaviour, but I am clueless why this happens: At creation of the adapter during fine-tuning the function model.print_trainable_parameters() prints: trainable params: 788,480 || all params: 560,005,120 || trainable%: 0.1407987126974839. After saving and loading the adapter the same function results in: trainable params: 2,048 || all params: 560,005,120 || trainable%: 0.00036571094207138677. If I save the loaded adapter again (without further fine-tuning), it stays at this level after loading it again.

Many thanks in advance, Christian

Here is my code for creating the adapter...

model_name_or_path = 'bigscience/bloom-560m'
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Get model configuration.
model_config = AutoConfig.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=n_labels)

# Get model's tokenizer.
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

# Get the actual model.
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path,
    config=model_config, 
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    load_in_4bit=True,
    quantization_config= bnb_config
)

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

#PEFT
print("loading to LoRA")
peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# resize model embedding to match new tokenizer and fix model padding token id
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = model.config.eos_token_id
model.to(device)

...and saving it.

global model
model.save_pretrained("adapters/lora_adapte_v2")

And here is the code for loading it again:

model_name_or_path = 'bigscience/bloom-560m'
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = AutoConfig.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=n_labels)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path,
    config=model_config, 
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    load_in_4bit=True,
    quantization_config= bnb_config
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

adapter = load_lora_name
model = PeftModel.from_pretrained(model, adapter)

model.print_trainable_parameters()

model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = model.config.eos_token_id

model.to(device)
print('Model loaded to `%s`'%device)