ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.68k stars 8.99k forks source link

Working with pcie x1 gen1 #5402

Closed userbox020 closed 5 months ago

userbox020 commented 5 months ago

Hello

I have been testing llamacpp with ubuntu 22.04 and rocm5.6 it took me about 3 months to setup multigpu one rx6900 two rx6800 and one rx 6700 all together running on pcie x1 gen1.

image

Llamacpp seems the only LLM loader that works with this setup, but i have notice that when the model its above 30gb size it get stuck loading it. Sometimes it takes between 1 to 2 hours to load it because but when loading it does inference really fast. But sometimes it just get stuck there, the longest time i have tested its 24 hours and it just stuck, the dots doesnt move.

its weird because its just happens with models above 30gb size, all other models loads fast and inference fast.

What can be doing this, any idea on how can i debug this to know whats going on?

Any idea, suggestion or help its very well welcome, thanks

slaren commented 5 months ago

Have you tried with --no-mmap?

userbox020 commented 5 months ago

Have you tried with --no-mmap?

@slaren beautiful bro, now its taking about 5 minutes to load codellama 70b

20:08:25-911759 INFO     Loading codellama-70b-python.Q4_K_M.gguf                                      
20:08:26-000032 INFO     llama.cpp weights detected: models/codellama-70b-python.Q4_K_M.gguf 
llm_load_tensors: ggml ctx size =    1.38 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 10944.75 MiB
llm_load_tensors:      ROCm1 buffer size =  7655.06 MiB
llm_load_tensors:      ROCm2 buffer size = 10533.06 MiB
llm_load_tensors:      ROCm3 buffer size = 10229.84 MiB
llm_load_tensors:  ROCm_Host buffer size =   140.70 MiB
....................................................................................................
20:13:22-266706 INFO     LOADER: llama.cpp                                                             
20:13:22-267434 INFO     TRUNCATION LENGTH: 4096                                                       
20:13:22-268065 INFO     INSTRUCTION TEMPLATE: Alpaca                                                  
20:13:22-268644 INFO     Loaded the model in 296.36 seconds.     

Downloading bigger quant right now to test it out. By the way im noticing the --numa does it help on performance too?

userbox020 commented 5 months ago

well going to close the the issue, but i would like to keep chatting with you guys, do you have a discord im doing some test and trying to enable vulkan and kompute for amd @slaren