Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c

System Info

transformers version: 4.37.0.dev0
Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.11.0
Huggingface_hub version: 0.20.1
Safetensors version: 0.4.0
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: NVIDIA GeForce RTX 4090 24GB
Using distributed or parallel set-up in script?:

Reproduction

I used own remote local machine via AnyDesk.
There happens some warning messages like “You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the call method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.” and “The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.float16.”
At first, I used windows environment, not linux. In the beginning, training was successful, with no crash, the model saved well after the training. After then, I added target module in lora config then training went well but crashed only after the training. Thought it was due to the individually compiled version of bitsandbytes, so I made wsl2 linux environment and tried to run the code but this time, training doesn't even start due to the following error: "Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c"
I didn't add any custom code I made, but only used hugging face library code for training.
I wonder that why does OOM happen even when shared GPU memory is avaliable about more than 100GB. Training only uses 3GB/128GB of shared GPU memory while using 23.5GB/24GB of dedicated GPU memory. Also, in the middle process of my project why did train went well while saving process didn't work properly. Is memory required more when saving the model?
By the fact that all of the process went well at the beginning, I think my hardware can support the training. Am I wrong??

Here is the code.

import transformers
from transformers import (BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, logging)
import torch
import os
from datasets import load_dataset, concatenate_datasets
import json
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

def main():
    base_model = "mistralai/Mistral-7B-Instruct-v0.2"
    new_model = "Mistral-7B-Instruct-v0.2_newmodel"

    compute_dtype = getattr(torch, "float16")

    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True, padding_side = "right")
    model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config = quant_config, attn_implementation = "flash_attention_2", device_map = {"": 0})
    model.config.use_cache = False
    model.config.pretraining_tp = 1

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    train_dataset = load_dataset('json', data_files = './dataset/mixed_train.json', split = 'train')
    eval_dataset = load_dataset('json', data_files = './dataset/mixed_val.json', split = 'train')
    print(f"train dataset size: {len(train_dataset)}, eval dataset size: {len(eval_dataset)}")

    training_params = TrainingArguments(
        output_dir="./FT_newmodel",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        per_device_eval_batch_size= 1,
        evaluation_strategy='steps',
        eval_steps=25,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        logging_steps=25,
        learning_rate=2e-5,
        weight_decay=0.001,
        fp16=False,
        bf16=False,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="constant",
        report_to="tensorboard"
    )

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
        ],
        task_type="CAUSAL_LM"
    )

    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        dataset_text_field= "text",
        args = training_params, 
        peft_config = peft_config,
        max_seq_length = 512,
        packing = False,
        neftune_noise_alpha = 5
    )

    trainer.train()
    trainer.model.save_pretrained(new_model)
    trainer.tokenizer.save_pretrained(new_model)

if __name__ == "__main__":
    print("Training starts")
    main()
    print("Training ended")

Expected behavior

I just want to know the answers of my questions in the reproduction section and way to solve the problem.

bitsandbytes-foundation / bitsandbytes

Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c #959

System Info

Reproduction

Expected behavior