LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.13k stars 354 forks source link

ROCm v1.43 crashes with dual GPUs #436

Open ccbadd opened 1 year ago

ccbadd commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I downloaded the current version 1.43 for ROCm and tried loading a model that fits in the VRAM of my GPU. I have two W6800 32G. It works fine so I decided to us the same model but the Q8 version that needs 34G of VRAM. It loads the model just like the smaller one but drops back to a command prompt when done and the GPUs memory is released.

This is running Win 11 22H2 with 96G ram and two Radeon Pro W6800's. I do have ROCm 5.5 (latest available for windows) installed but the docs seem to say it is not needed if you are using the precompiled binary.

Logs don't show anything.

LostRuins commented 1 year ago

Sorry for the confusion - the 1.43 version on my repo does not support ROCm out of the box on windows for the precompiled binaries. Instead, check out the binaries at yellowrose's fork : https://github.com/YellowRoseCx/koboldcpp-rocm/releases/tag/Windows-v1.43-ROCm

To see the error before the windows closes, try running the exe from the terminal instead of double clicking it - that will show you any errors.

LostRuins commented 1 year ago

@yellowrosecx I think maybe to avoid confusing people, I should label the dropdown in my version as cublas instead of cublas/hipblas, and then add a note to direct rocm users to your fork to obtain rocm binaries?

ccbadd commented 1 year ago

The one I'm testing is the YellowRoseCx version. I'm using the same koboldcpp_rocm.exe binary from the link you posted above. I am seeing the same happen with llama.cpp directly also so I do believe it is a problem in that code.

LostRuins commented 1 year ago

Okay, then try to capture the crash message by running it using the terminal.

YellowRoseCx commented 1 year ago

@yellowrosecx I think maybe to avoid confusing people, I should label the dropdown in my version as cublas instead of cublas/hipblas, and then add a note to direct rocm users to your fork to obtain rocm binaries?

You want to remove hipBLAS from your branch?

ccbadd commented 1 year ago

Here is the terminal output with just one card:

***
Welcome to KoboldCpp - Version 1.43.YR1
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(bantokens=None, blasbatchsize=512, blasthreads=12, config=None, contextsize=2048, debugmode=False, forceversion=0, gpulayers=99, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='D:/models/wizardcoder-python-34b-v1.0.Q8_0.gguf', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=12, unbantokens=False, useclblast=None, usecublas=['normal', 'mmq'], usemirostat=None, usemlock=False)
==========
Loading model: D:\models\wizardcoder-python-34b-v1.0.Q8_0.gguf
[Threads: 12, BlasThreads: 12, SmartContext: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W6800, compute capability 10.3
llama_model_loader: loaded meta data with 17 key-value pairs and 435 tensors from D:\models\wizardcoder-python-34b-v1.0.Q8_0.gguP{~üllm_load_print_meta: format         = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32001
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 16384
llm_load_print_meta: n_ctx          = 2048
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 48
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 22016
llm_load_print_meta: freq_base      = 1000000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 34B
llm_load_print_meta: model ftype    = unknown, may not work
llm_load_print_meta: model size     = 33.74 B
llm_load_print_meta: general.name   = LLaMA
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  265.77 MB (+  384.00 MB per state)
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 51/51 layers to GPU
llm_load_tensors: VRAM used: 34313 MB
....................................................................................................
llama_new_context_with_model: kv self size  =  384.00 MB
llama_new_context_with_model: compute buffer total size =  305.47 MB
llama_new_context_with_model: VRAM scratch buffer: 304.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

And here is the one that crashes:


***
Welcome to KoboldCpp - Version 1.43.YR1
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(bantokens=None, blasbatchsize=512, blasthreads=12, config=None, contextsize=2048, debugmode=False, forceversion=0, gpulayers=99, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='D:/models/wizardcoder-python-34b-v1.0.Q8_0.gguf', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=12, unbantokens=False, useclblast=None, usecublas=['normal', 'mmq'], usemirostat=None, usemlock=False)
==========
Loading model: D:\models\wizardcoder-python-34b-v1.0.Q8_0.gguf
[Threads: 12, BlasThreads: 12, SmartContext: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W6800, compute capability 10.3
  Device 1: AMD Radeon PRO W6800, compute capability 10.3
llama_model_loader: loaded meta data with 17 key-value pairs and 435 tensors from D:\models\wizardcoder-python-34b-v1.0.Q8_0.ggup|_N]llm_load_print_meta: format         = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32001
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 16384
llm_load_print_meta: n_ctx          = 2048
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 48
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 22016
llm_load_print_meta: freq_base      = 1000000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 34B
llm_load_print_meta: model ftype    = all F32
llm_load_print_meta: model size     = 33.74 B
llm_load_print_meta: general.name   = LLaMA
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (AMD Radeon PRO W6800) as main device
llm_load_tensors: mem required  =  265.77 MB (+  384.00 MB per state)
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 51/51 layers to GPU
llm_load_tensors: VRAM used: 34313 MB
....................................................................................................
llama_new_context_with_model: kv self size  =  384.00 MB
llama_new_context_with_model: compute buffer total size =  305.47 MB
llama_new_context_with_model: VRAM scratch buffer: 304.00 MB
(tgrocm) PS D:\koboldcpp>```
YellowRoseCx commented 1 year ago

With multi GPU on AMD you have to use --lowvram for some reason

ccbadd commented 1 year ago

That made it work YellowRose Cx! Thanks.

ccbadd commented 1 year ago

With multi GPU on AMD you have to use --lowvram for some reason

This make inference very slow. Is this due to a bug in the windows version of ROCm? I don't have this problem under Ubuntu.

YellowRoseCx commented 1 year ago

With multi GPU on AMD you have to use --lowvram for some reason

This make inference very slow. Is this due to a bug in the windows version of ROCm? I don't have this problem under Ubuntu.

Idk, I always use lowvram even with 1 GPU, it only slows down my processing speed a few seconds but then makes it up in generation speed since I can add more layers. Windows doesn't support Vega 64 for ROCm which is my 2nd GPU. The only difference between windows and linux koboldcpp ROCm is the AMD ROCm drivers

Are you sure it's not running out of VRAM and making it slower? Windows uses more VRAM - than Linux natively

But on Ubuntu I've not really noticed any difference in speed between lowVram on or off

ccbadd commented 1 year ago

YellowRose, I have a pair of W6800's each with 32 GB VRAM. I don't even get close to running out of VRAM with 70B Q5 models or smaller but the low vram options cuts the speed by more than half. With the current release on windows it is crashing after loading even with the low vram option set now. I think it is ignoring the option as this is still logged to the screen: llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU I end up using 22GB on card 0 and 24GB on card 1 of VRAM.

Here is the enire output:


***
Welcome to KoboldCpp - Version 1.45.2.yr0-ROCm
For command line arguments, please refer to --help
***
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
Initializing dynamic library: koboldcpp_hipblas.dll
==========
Namespace(model=None, model_param='D:/models/synthia-70b-v1.2.Q5_K_S.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=False, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['lowvram', 'mmq'], gpulayers=99, tensor_split=None, onready='', multiuser=False, foreground=False, psutil_set_threads=False, stream=False, unbantokens=False, usemirostat=None)
==========
Loading model: D:\models\synthia-70b-v1.2.Q5_K_S.gguf
[Threads: 15, BlasThreads: 15, SmartContext: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W6800, compute capability 10.3
  Device 1: AMD Radeon PRO W6800, compute capability 10.3
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from D:\models\synthia-70b-v1.2.Q5_K_S.gguf (versioâ■]`llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_1, some F16
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 44.20 GiB (5.50 BPW)
llm_load_print_meta: general.name   = migtissera_synthia-70b-v1.2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (AMD Radeon PRO W6800) as main device
llm_load_tensors: mem required  =  172.11 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 45090.11 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 640.00 MB
llama_new_context_with_model: kv self size  =  640.00 MB
llama_new_context_with_model: compute buffer total size = 313.88 MB
llama_new_context_with_model: VRAM scratch buffer: 308.00 MB
llama_new_context_with_model: total VRAM used: 46038.11 MB (model: 45090.11 MB, context: 948.00 MB)
(base) PS D:\koboldcpp>```
YellowRoseCx commented 1 year ago

YellowRose, I have a pair of W6800's each with 32 GB VRAM. I don't even get close to running out of VRAM with 70B Q5 models or smaller but the low vram options cuts the speed by more than half. With the current release on windows it is crashing after loading even with the low vram option set now. I think it is ignoring the option as this is still logged to the screen: llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU I end up using 22GB on card 0 and 24GB on card 1 of VRAM.

Here is the enire output:


***
Welcome to KoboldCpp - Version 1.45.2.yr0-ROCm
For command line arguments, please refer to --help
***
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
Initializing dynamic library: koboldcpp_hipblas.dll
==========
Namespace(model=None, model_param='D:/models/synthia-70b-v1.2.Q5_K_S.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=2048, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=False, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['lowvram', 'mmq'], gpulayers=99, tensor_split=None, onready='', multiuser=False, foreground=False, psutil_set_threads=False, stream=False, unbantokens=False, usemirostat=None)
==========
Loading model: D:\models\synthia-70b-v1.2.Q5_K_S.gguf
[Threads: 15, BlasThreads: 15, SmartContext: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon PRO W6800, compute capability 10.3
  Device 1: AMD Radeon PRO W6800, compute capability 10.3
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from D:\models\synthia-70b-v1.2.Q5_K_S.gguf (versioâ■]`llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_1, some F16
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 44.20 GiB (5.50 BPW)
llm_load_print_meta: general.name   = migtissera_synthia-70b-v1.2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (AMD Radeon PRO W6800) as main device
llm_load_tensors: mem required  =  172.11 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 45090.11 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 640.00 MB
llama_new_context_with_model: kv self size  =  640.00 MB
llama_new_context_with_model: compute buffer total size = 313.88 MB
llama_new_context_with_model: VRAM scratch buffer: 308.00 MB
llama_new_context_with_model: total VRAM used: 46038.11 MB (model: 45090.11 MB, context: 948.00 MB)
(base) PS D:\koboldcpp>```

I wasn't able to use 2 GPUs on Windows either, but mine were 2 different cards (6800xt and Vega 64) and I don't know why. I'm sorta interested in what your normal usecase is with 2 workstation GPUs running Windows tbh; have you tried Linux? And on Linux, you could run GPTQ models with that much VRAM using PyTorch. Hopefully Windows ROCm continues getting better to support AI features. About the lowVram option, Llama.cpp upstream removed it because it wasn't working correctly so that's probably why you're not seeing it make a difference

ccbadd commented 1 year ago

Thanks for the reply. I have the two workstation gpu in my windows 11 machine just for experimenting with code generation. I also have a Ubuntu server with 2 Instinct MI100's so I can get some additional feature for my home automation. Everything works fine on the linux machine with both llama.cpp and koboldcpp but that is a newer version of ROCm. I might move the two w6800s over to the linux server as it can handle two more. I really want to know if the current software can handle two different model gpus (MI100s and W6800s) and then see how I could get all that I want from that setup.

FYI, llama.cpp works fine under windows with 2 amd gpus and no low vram option.