bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.01k stars 606 forks source link

Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c #959

Open johnDonor opened 8 months ago

johnDonor commented 8 months ago

System Info

Reproduction

  1. I used own remote local machine via AnyDesk.
  2. There happens some warning messages like “You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the call method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.” and “The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.float16.”
  3. At first, I used windows environment, not linux. In the beginning, training was successful, with no crash, the model saved well after the training. After then, I added target module in lora config then training went well but crashed only after the training. Thought it was due to the individually compiled version of bitsandbytes, so I made wsl2 linux environment and tried to run the code but this time, training doesn't even start due to the following error: "Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c"
  4. I didn't add any custom code I made, but only used hugging face library code for training.
  5. I wonder that why does OOM happen even when shared GPU memory is avaliable about more than 100GB. Training only uses 3GB/128GB of shared GPU memory while using 23.5GB/24GB of dedicated GPU memory. Also, in the middle process of my project why did train went well while saving process didn't work properly. Is memory required more when saving the model?
  6. By the fact that all of the process went well at the beginning, I think my hardware can support the training. Am I wrong??

Here is the code.

import transformers
from transformers import (BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, logging)
import torch
import os
from datasets import load_dataset, concatenate_datasets
import json
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

def main():
    base_model = "mistralai/Mistral-7B-Instruct-v0.2"
    new_model = "Mistral-7B-Instruct-v0.2_newmodel"

    compute_dtype = getattr(torch, "float16")

    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True, padding_side = "right")
    model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config = quant_config, attn_implementation = "flash_attention_2", device_map = {"": 0})
    model.config.use_cache = False
    model.config.pretraining_tp = 1

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    train_dataset = load_dataset('json', data_files = './dataset/mixed_train.json', split = 'train')
    eval_dataset = load_dataset('json', data_files = './dataset/mixed_val.json', split = 'train')
    print(f"train dataset size: {len(train_dataset)}, eval dataset size: {len(eval_dataset)}")

    training_params = TrainingArguments(
        output_dir="./FT_newmodel",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        per_device_eval_batch_size= 1,
        evaluation_strategy='steps',
        eval_steps=25,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        logging_steps=25,
        learning_rate=2e-5,
        weight_decay=0.001,
        fp16=False,
        bf16=False,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="constant",
        report_to="tensorboard"
    )

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
        ],
        task_type="CAUSAL_LM"
    )

    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        dataset_text_field= "text",
        args = training_params, 
        peft_config = peft_config,
        max_seq_length = 512,
        packing = False,
        neftune_noise_alpha = 5
    )

    trainer.train()
    trainer.model.save_pretrained(new_model)
    trainer.tokenizer.save_pretrained(new_model)

if __name__ == "__main__":
    print("Training starts")
    main()
    print("Training ended")

Expected behavior

I just want to know the answers of my questions in the reproduction section and way to solve the problem.

yifan1130 commented 7 months ago

How did you solve it?