Fix for gemma-2-9b - run with blfloat16

https://huggingface.co/google/gemma-2-27b/tree/main

Times

6:03 for CPU only 14900K 128G 4200Mhz ram - running 120G
for GPU+RAM NVidia A6000 Ada 48G + 13900K 128G 4200Mhz ram - running 47 + 87G = 134G

code change

#model = "google/gemma-7b"
model = "google/gemma-2-27b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# GPU
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
# CPUi
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)


michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Traceback (most recent call last):
  File "C:\opt\Python312\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
                   ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\environments\windows\src\google-gemma\gemma-gpu.py", line 16, in <module>
    model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\auto\auto_factory.py", line 523, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

cuda 12.5 on A6000 updating transformers

pip install git+https://github.com/huggingface/transformers.git

Successfully built transformers
Installing collected packages: huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.22.2
    Uninstalling huggingface-hub-0.22.2:
      Successfully uninstalled huggingface-hub-0.22.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.40.0
    Uninstalling transformers-4.40.0:
      Successfully uninstalled transformers-4.40.0
Successfully installed huggingface-hub-0.23.4 transformers-4.43.0.dev0

[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

working

michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Downloading shards:   0%|          | 0/24 [00:00<?, ?it/s]

hugging face download start 1415 at 1Gbps - 7h eta

24 x 5G = 120G Likely will only run in CPU Ram min 192G

watch flash attention

gpu + cpu cpu only

Loading checkpoint shards: 100%|##########| 24/24 [00:14<00:00,  1.65it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(

fix https://huggingface.co/google/gemma-2-27b#other-optimizations

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
+   attn_implementation="flash_attention_2"
).to(0)

84G ram, 47G vram = 131G -(actual 120G on cpu only)

192G ram - 13900K dual 4090

$ python gemma-gpu.py
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████| 42.8k/42.8k [00:00<00:00, 6.00MB/s]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\huggingface_hub\file_download.py:157: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\michael\.cache\huggingface\hub\models--google--gemma-2-27b. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
model-00001-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:44<00:00, 111MB/s]
model-00002-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00003-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00004-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00005-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00006-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00007-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00008-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00009-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00010-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00011-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00012-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00013-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:42<00:00, 107MB/s]
model-00014-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00015-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00016-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00017-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:42<00:00, 108MB/s]
model-00018-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00019-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00020-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00021-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00022-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00023-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:43<00:00, 103MB/s]
model-00024-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.30G/4.30G [00:39<00:00, 108MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 24/24 [16:44<00:00, 41.84s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 24/24 [00:26<00:00,  1.10s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<?, ?B/s]
genarate start:  15:56:38
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.

Answer:

Step 1/2
1. Gold is not created in collapsing neutron stars. Gold is created in the process of supernovae.

Step 2/2
2. The process of gold formation is called the r-process, and it occurs during the r-process.

 2.1. The r-process is a process that occurs during the r-process.<eos>
end 16:00:30

Lenovo P17 XEON W-10855M 2.8G 128G ram 16G 2933Mhz T104 RTX-5000 CPU only - Gemma 2 27B

micha@LAPTOP-M4VQDR8K MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|##########| 24/24 [01:43<00:00,  4.33s/it]
genarate start:  09:37:54
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.

Answer:

Step 1/2
1. Gold is not created in collapsing neutron stars. Gold is created in the process of supernovae.

Step 2/2
2. The process of gold formation is called the r-process, and it occurs during the r-process.

 2.1. The r-process is a process that occurs during the r-process.<eos>
end 09:45:57
(base)

Gemma 2 9B - CPU

Triage multi GPU setup

Running gemma-7b across 2 24G GPUs works fine using https://obrienlabs.medium.com/running-the-larger-google-gemma-7b-35gb-llm-for-7x-inference-performance-gain-8b63019523bb

#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"

michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 09:48:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
|  0%   44C    P2             84W /  480W |   18302MiB /  24564MiB |     53%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
| 30%   44C    P2            108W /  480W |   15959MiB /  24564MiB |     44%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

For gemma-2-9b I am having issues only with multi-gpu setups - OOM error as pytorch uses only one of the GPUs. If I set CUDA_VISIBLE_DEVICES to 1 - I run with 1 gpu as expected and the rest of the model runs on CPU - with expected PCIe slowdown

1 GPU + CPU

2 GPU - only 1 recognized

import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM
access_token='hf_cfTP..qH'
model = "google/gemma-2-9b" # not working for 2 GPUs
#model = "google/gemma-7b" # working for 2 GPUs
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
# cpu
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)
input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

we get ram allocated but eventually fail

michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 09:51:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
|  0%   40C    P8             27W /  480W |   19760MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
|  0%   44C    P2             87W /  480W |   14915MiB /  24564MiB |     10%      Default |
|                                         |                        |                  N/A |

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.44s/it]
genarate start:  09:40:57
Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\environments\windows\src\google-gemma\gemma-gpu.py", line 25, in <module>
    outputs = model.generate(**input_ids, max_new_tokens=10000)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py", line 1744, in generate
    model_kwargs["past_key_values"] = self._get_cache(
                                      ^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py", line 1435, in _get_cache
    self._cache = cache_cls(
                  ^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\cache_utils.py", line 1011, in __init__
    new_layer_value_cache = torch.zeros(cache_shape, dtype=self.dtype, device=device)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 65.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Fixed - change the torch_dtype

model = AutoModelForCausalLM.from_pretrained( model, device_map="auto", torch_dtype=torch.bfloat16,

michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 10:07:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
|  0%   41C    P3             68W /  480W |   12686MiB /  24564MiB |     50%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
|  0%   43C    P2             71W /  480W |   10287MiB /  24564MiB |     26%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

import os

import os
import torch # https://huggingface.co/google/gemma-2-9b-it
#os.environ["CUDA_VISIBLE_DEVICES"] = "1,0"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" 

from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime

access_token='hf_cfTKXXCQqH'
model = "google/gemma-2-9b-it"
#model = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(
    model, 
    device_map="auto",
    torch_dtype=torch.bfloat16, 
    token=access_token)
# cpu
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)
input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))
# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
genarate start:  10:07:21
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.<end_of_turn>
<eos>
end 10:07:22

gemma-2-9b working as well but slower

|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
| 30%   45C    P2            162W /  480W |   12686MiB /  24564MiB |     45%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
|  0%   52C    P2            150W /  480W |   10302MiB /  24564MiB |     44%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

see https://github.com/ObrienlabsDev/machine-learning/issues/28

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

testing gemma-2-9b using float16 - at 23G instead of 32G

torch_dtype=torch.float16,#bfloat16,

ObrienlabsDev / machine-learning

Google Gemma 2 27B is out - setup inference and upgrade transformers - run on 48G A6000 Ada and 128G 14900K #27

Fix for gemma-2-9b - run with blfloat16

Triage multi GPU setup