Code not running on GPU

Lazy3valuation commented 6 months ago

It seems like the code is forced to run on CPU (sending my computer out of ram). If I output torch gpu available is says true, and it's using GPU, but the model still loads on CPU ram. Looking into the code, it seems that any additional config (as "device_map={"GPU": 0}" inside the GemmaConfig.from_pretrained) is ignored and not used by the modeling_gemma.py... Any advice?

Beomi commented 6 months ago

Just updated test code so that you can run test code on your cuda gpu! :)

Lazy3valuation commented 6 months ago

Hi! I pulled and.. it's still not working :-( With a fresh clone and install (using gemma-2b downloaded in local) the code correctly says "Torch Version: 2.2.2+cu121 CUDA: True" before printing config... But when the "model = GemmaForCausalLM(config)" is executed, the computer freezes or the program gets killed. Looking in the terminal with "top" command, the RAM quickly goes up (it takes about 60% ram before crash, out of 16GB ram), same for the CPU. Looking with nvidia watch command, the GPU ram is stable at 490MB used, not going up even a bit. I also noticed that the transformer tag written in the readme (b109257f4f) does not exist and the latest tag is "09f9f56" for the 4.39.3 transformer version... And you need root permission to install that with pip (that should not happen). Any idea why? If I run the model with AutoModelForCausalLM.from_pretrained, it correctly runs on GPU, but it does not use your file. Adding the params torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map={"": 0}, in the GemmaConfig (even with BitsAndBytes quantization) still doesn't change anything.

Edit: as a GPU I have a 3060 12GB ram. With 8bit quantization I should be able to load the model twice (with infiniAttention setting and the clean model, before transfering weights), but even without quantization, I should at least be able to load a full model. But this does not happen.

Lazy3valuation commented 6 months ago

If needed, here's the output before crash:

Torch Version: 2.2.2+cu121
CUDA: True
GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "memory_size": 2048,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "segment_size": 16,
  "torch_dtype": "float16",
  "transformers_version": "4.39.3",
  "use_cache": false,
  "vocab_size": 256000
}

The outcome is the same, both with unmodified test_basic and with the quantization & device set try. Here's the modified one:

import os
from transformers import GemmaConfig, GemmaForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch.nn.functional as F

import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # TODO: set the GPU device

print("Torch Version:", torch.__version__)
print("CUDA:", torch.cuda.is_available())

if torch.cuda.is_available():
    device = "cuda:0"  # set GPU device using CUDA_VISIBLE_DEVICES
else:
    device = "cpu"

model_name = "gemma-2b"

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)

config = GemmaConfig.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map={"": 0},
)
config.memory_size = 2048
config.use_cache = False
config.segment_size = 16

print(config)

# Create the Gemma model with Infini-attention
model = GemmaForCausalLM(config) #The program crashes here

Lazy3valuation commented 6 months ago

Update: after a fresh install with a new virtual env, I installed the right transformer version without root permissions ("transformers_version": "4.40.0.dev0", copied from config's print). This still doesn't fix the RAM problem. I noticed that it happens even when using the huggingface id of gemma-2b without the local model, even before downloading.. so it might be something that happens before model download.

Beomi commented 6 months ago

https://huggingface.co/docs/accelerate/usage_guides/big_modeling

I think you'd better check this out :)

As you can see the model init is processed at system ram so it requires >10GB of free mem

Lazy3valuation commented 6 months ago

https://huggingface.co/docs/accelerate/usage_guides/big_modeling

I think you'd better check this out :)

As you can see the model init is processed at system ram so it requires >10GB of free mem

That's interesting, I wonder why it doesn't get loaded on normal ram when using AutoModelForCausalLM. Adding 4gb of virtual memory, I managed to load both models. But now GPU ram is the problem... No matter if the model is quantized. Here's what happens: The initial model (with 8 bit precision - seems like this setting is ignored) with eager attention is loaded into the RAM. It uses about 11 GB ram. The second model (with 8 bit precision) is loaded into GPU ram. It takes about 3.6 GB ram. The code loads the second model's weight into the first one. No ram change. I clean the second model with:

del pretrained_model
gc.collect()
torch.cuda.empty_cache()

Now GPU ram is back to 460 MB used RAM out of 12GB. The code gets to "model.to(device)". Now the CPU RAM is again 4GB occupied, while the GPU ram is 10.5GB occupied: the model has been successfully moved to GPU. This is very limited; almost any operation which increases GPU ram usage will make it go out of memory. I usually avoid this using quantization.

As expected, the code runs until it executes this line: outputs.loss.backward() Then... the GPU runs out of memory, allocating 2GB more and failing to allocare another 2GB. "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 0 has a total capacity of 11.75 GiB of which 1.71 GiB is free."

Is there a solution? Does this code support quantization?

Beomi / InfiniTransformer

Code not running on GPU #6