Open obriensystems opened 4 months ago
cuda 12.5 on A6000 updating transformers
pip install git+https://github.com/huggingface/transformers.git
Successfully built transformers
Installing collected packages: huggingface-hub, transformers
Attempting uninstall: huggingface-hub
Found existing installation: huggingface-hub 0.22.2
Uninstalling huggingface-hub-0.22.2:
Successfully uninstalled huggingface-hub-0.22.2
Attempting uninstall: transformers
Found existing installation: transformers 4.40.0
Uninstalling transformers-4.40.0:
Successfully uninstalled transformers-4.40.0
Successfully installed huggingface-hub-0.23.4 transformers-4.43.0.dev0
[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
working
michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Downloading shards: 0%| | 0/24 [00:00<?, ?it/s]
hugging face download start 1415 at 1Gbps - 7h eta
24 x 5G = 120G Likely will only run in CPU Ram min 192G
watch flash attention
gpu + cpu cpu only
Loading checkpoint shards: 100%|##########| 24/24 [00:14<00:00, 1.65it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
fix https://huggingface.co/google/gemma-2-27b#other-optimizations
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
+ attn_implementation="flash_attention_2"
).to(0)
84G ram, 47G vram = 131G -(actual 120G on cpu only)
192G ram - 13900K dual 4090
$ python gemma-gpu.py
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████| 42.8k/42.8k [00:00<00:00, 6.00MB/s]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\huggingface_hub\file_download.py:157: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\michael\.cache\huggingface\hub\models--google--gemma-2-27b. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
model-00001-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:44<00:00, 111MB/s]
model-00002-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00003-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00004-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00005-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00006-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:40<00:00, 111MB/s]
model-00007-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00008-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00009-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00010-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00011-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00012-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00013-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:42<00:00, 107MB/s]
model-00014-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00015-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00016-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 110MB/s]
model-00017-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:42<00:00, 108MB/s]
model-00018-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00019-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 108MB/s]
model-00020-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00021-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00022-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:41<00:00, 109MB/s]
model-00023-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.53G/4.53G [00:43<00:00, 103MB/s]
model-00024-of-00024.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.30G/4.30G [00:39<00:00, 108MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 24/24 [16:44<00:00, 41.84s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 24/24 [00:26<00:00, 1.10s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<?, ?B/s]
genarate start: 15:56:38
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.
Answer:
Step 1/2
1. Gold is not created in collapsing neutron stars. Gold is created in the process of supernovae.
Step 2/2
2. The process of gold formation is called the r-process, and it occurs during the r-process.
2.1. The r-process is a process that occurs during the r-process.<eos>
end 16:00:30
gemma 2 9B 42G
Lenovo P17 XEON W-10855M 2.8G 128G ram 16G 2933Mhz T104 RTX-5000 CPU only - Gemma 2 27B
micha@LAPTOP-M4VQDR8K MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|##########| 24/24 [01:43<00:00, 4.33s/it]
genarate start: 09:37:54
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.
Answer:
Step 1/2
1. Gold is not created in collapsing neutron stars. Gold is created in the process of supernovae.
Step 2/2
2. The process of gold formation is called the r-process, and it occurs during the r-process.
2.1. The r-process is a process that occurs during the r-process.<eos>
end 09:45:57
(base)
Gemma 2 9B - CPU
Running gemma-7b across 2 24G GPUs works fine using https://obrienlabs.medium.com/running-the-larger-google-gemma-7b-35gb-llm-for-7x-inference-performance-gain-8b63019523bb
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"
michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 09:48:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 Off | Off |
| 0% 44C P2 84W / 480W | 18302MiB / 24564MiB | 53% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:02:00.0 On | Off |
| 30% 44C P2 108W / 480W | 15959MiB / 24564MiB | 44% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
For gemma-2-9b I am having issues only with multi-gpu setups - OOM error as pytorch uses only one of the GPUs. If I set CUDA_VISIBLE_DEVICES to 1 - I run with 1 gpu as expected and the rest of the model runs on CPU - with expected PCIe slowdown
1 GPU + CPU
2 GPU - only 1 recognized
import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"
from transformers import AutoTokenizer, AutoModelForCausalLM
access_token='hf_cfTP..qH'
model = "google/gemma-2-9b" # not working for 2 GPUs
#model = "google/gemma-7b" # working for 2 GPUs
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
# cpu
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)
input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))
we get ram allocated but eventually fail
michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 09:51:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 Off | Off |
| 0% 40C P8 27W / 480W | 19760MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:02:00.0 On | Off |
| 0% 44C P2 87W / 480W | 14915MiB / 24564MiB | 10% Default |
| | | N/A |
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00, 1.44s/it]
genarate start: 09:40:57
Traceback (most recent call last):
File "C:\wse_github\obrienlabsdev\machine-learning\environments\windows\src\google-gemma\gemma-gpu.py", line 25, in <module>
outputs = model.generate(**input_ids, max_new_tokens=10000)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py", line 1744, in generate
model_kwargs["past_key_values"] = self._get_cache(
^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py", line 1435, in _get_cache
self._cache = cache_cls(
^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\cache_utils.py", line 1011, in __init__
new_layer_value_cache = torch.zeros(cache_shape, dtype=self.dtype, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 65.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Fixed - change the torch_dtype
model = AutoModelForCausalLM.from_pretrained( model, device_map="auto", torch_dtype=torch.bfloat16,
michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jun 29 10:07:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 Off | Off |
| 0% 41C P3 68W / 480W | 12686MiB / 24564MiB | 50% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:02:00.0 On | Off |
| 0% 43C P2 71W / 480W | 10287MiB / 24564MiB | 26% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
import os
import os
import torch # https://huggingface.co/google/gemma-2-9b-it
#os.environ["CUDA_VISIBLE_DEVICES"] = "1,0"
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime
access_token='hf_cfTKXXCQqH'
model = "google/gemma-2-9b-it"
#model = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(
model,
device_map="auto",
torch_dtype=torch.bfloat16,
token=access_token)
# cpu
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)
input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))
# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it]
genarate start: 10:07:21
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.<end_of_turn>
<eos>
end 10:07:22
gemma-2-9b working as well but slower
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 Off | Off |
| 30% 45C P2 162W / 480W | 12686MiB / 24564MiB | 45% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:02:00.0 On | Off |
| 0% 52C P2 150W / 480W | 10302MiB / 24564MiB | 44% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
see https://github.com/ObrienlabsDev/machine-learning/issues/28
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
testing gemma-2-9b using float16 - at 23G instead of 32G
torch_dtype=torch.float16,#bfloat16,
Fix for gemma-2-9b - run with blfloat16
https://huggingface.co/google/gemma-2-27b/tree/main
Times
code change