Error running ghcr.io/evshiron/rocm_lab:rocm5.5-text-gen-webui 7dea7110f293

briansp2020 commented 1 year ago

starlette.websockets.WebSocketDisconnect: 1001 INFO:Loading TheBloke_Llama-2-13B-chat-GGML... INFO:llama.cpp weights detected: models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q6_K.bin

INFO:Cache capacity is 0 bytes llama.cpp: loading model from models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q6_K.bin error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model Exception ignored in: <function LlamaCppModel.del at 0x7fdeac07f910> Traceback (most recent call last): File "/root/text-generation-webui/modules/llamacpp_model.py", line 23, in del self.model.del() AttributeError: 'LlamaCppModel' object has no attribute 'model'

briansp2020 commented 1 year ago

Oops, I posted before adding the details. I'm trying to run the text generation stuff using the docker container and am getting the error shown in the first message. My hardware is 7900XTX & Ryzen 3950X, ROCm 5.6.1 kernel. I think this is purely a software setup issue though.

briansp2020 commented 1 year ago

I built a new docker container using the dockerfile/rocm5.5-text-gen-webui and, after some minor modification, it built a docker container that works. The modification I had to make is to add 'apt update' command before activating venv and running build_text-gen-webui.sh

briansp2020 commented 1 year ago

Compared to CUDA version, ROCm version does not show how many layers are offloaded to GPU (i.e. n-gpu-layers does not seem to do anything.) Does this version always use GPU?

ROCm output

llama.cpp: loading model from models/llama-2-13b-chat.ggmlv3.q5_K_M.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_head_kv = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 13824 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium) llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.11 MB llama_model_load_internal: mem required = 8801.74 MB (+ 1600.00 MB per state) llama_new_context_with_model: kv self size = 1600.00 MB llama_new_context_with_model: compute buffer total size = 191.35 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 2023-09-02 13:29:39 INFO:Loaded the model in 0.28 seconds.

CUDA output

ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6 2023-09-02 16:45:57 INFO:llama.cpp weights detected: models/llama-2-13b-chat.ggmlv3.q5_K_M.bin 2023-09-02 16:45:57 INFO:Cache capacity is 0 bytes llama.cpp: loading model from models/llama-2-13b-chat.ggmlv3.q5_K_M.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_head_kv = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 13824 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium) llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.11 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 9295.74 MB (+ 1600.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/43 layers to GPU llama_model_load_internal: total VRAM used: 480 MB

I was so happy to see ROCm version running so much faster. But it turned out that I was not offloading anything to GPU in my NVidia setup. :(

I know this isn't ROCm_lab specific questions. So, if anyone can give me a pointer to a forum where I can ask text-generation-webui general questions, I'd appreciate it. Thank you!

evshiron commented 1 year ago

Greetings. rocm_lab:rocm5.5-text-gen-webui was built for PoC only.

Here is a tutorial which was updated recently:

https://are-we-gfx1100-yet.github.io/post/text-gen-webui/

It seems that ROCm support for llama.cpp has been merged last week, but text-generation-webui uses its Python binding and I haven't tested it yet.

The tutorial covers HuggingFace and GPTQ usages. GPTQ is a decent quantization solution for LLMs, and you can obtain many quantized models from:

https://huggingface.co/TheBloke

If you definitely want to use GGML at the moment, you might have to do it yourself as I am too busy to update the tutorial currently.

briansp2020 commented 1 year ago

I'm new to generative AI stuff and my ultimate goal is to find out how AMD compares to NVidia in all things AI. So, I'm willing to try anything. Thanks for the pointer. I'll go through your tutorial. Hopefully, I can get up to speed fast and send you a pull request when I update your stuff with ROCm 5.6.1!

evshiron commented 1 year ago

@briansp2020

Thank you for your kindness. As far as I know, ROCm 5.6.1 is a minor release for some bug fixes, and the articles written for ROCm 5.6 should work too. As a result those articles haven't been updated for ROCm 5.6.1.

For the best of LLM performance on AMD GPUs, here is an abnormal solution:

https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference

Which achieves 80% performance of a RTX 4090 on a RX 7900 XTX. The potential of Navi 3x is here, but most HIP code ported from CUDA cannot fully unleash the performance of AMD GPUs.

grigio commented 1 year ago

@evshiron which performance do you have in token/s with a 70B q4 model ?

evshiron commented 1 year ago

@grigio

A 70B q4 model should be around 35GB in size. I haven't tried it as I only have one RX 7900 XTX.

But if you're referring to offloading to GPU using llama.cpp, I am also interested in it. I might try it out later.

UPDATE:

Build:

git clone https://github.com/ggerganov/llama.cpp
make LLAMA_HIPBLAS=1

Command:

./main -t 8 -m llama-2-70b-chat.q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\nWrite a story about llamas[/INST]"

Hardware: AMD 7800X3D + AMD RX 7900 XTX

Model: https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF/blob/main/llama-2-70b-chat.Q5_K_M.gguf

Without -ngl (1.41 tokens/s):

Once upon a time, in the rolling hills of
 South America, there lived a group of magnificent creatures known as llamas. These animals were unlike any others, with their soft, woolly coats and graceful necks. They roamed the grassy plains, munching on tender shoots and enjoying the warm sunshine.

One llama in particular, named Luna, was known for her curious nature and adventurous spirit. She loved to explore the surrounding hills, discovering hidden streams and secret meadows. Her best friend, a loyal llama named Sam, always accompanied her on these exciting journeys.

One day, Luna and Sam decided to venture further than they ever had before. They climbed to the top of a tall hill, where they found a breathtaking view of the surrounding landscape. In the distance, they
 spotted a shimmering lake, its surface reflecting the bright blue sky.

Without hesitation, Luna and Sam began their descent down the hill, their hooves carefully navigating the rocky terrain. As they approached the lake, they noticed a group of animals gathered at its edge. There were birds with vibrant feathers, frolicking in the shallow water, and a family of deer, drinking from the lake's clear waters.

Luna and Sam joined the group, and soon found themselves surrounded by new friends. They spent the day playing and swimming, enjoying the simple pleasures of life. As the sun began to set, they bid farewell to their new companions and began their journey back home.

From that day on, Luna and Sam became known as the most adventurous llamas in the land. They continued to explore the surrounding hills and valleys, always discovering new wonders and making new friends along the way. And they lived happily ever after, with the beautiful memories of their exciting adventures forever etched in their hearts.

The end. [end of text]

llama_print_timings:        load time =  1348.01 ms
llama_print_timings:      sample time =   167.85 ms /   430 runs   (    0.39 ms per token,  2561.73 tokens per second)
llama_print_timings: prompt eval time =  8633.85 ms /   143 tokens (   60.38 ms per token,    16.56 tokens per second)
llama_print_timings:        eval time = 303371.18 ms /   429 runs   (  707.16 ms per token,     1.41 tokens per second)
llama_print_timings:       total time = 312377.39 ms
Log end

With -ngl 47 (2.90 tokens/s):

Once upon a time, in the Andes mountains of South America, there lived a group of elegant and graceful creatures known as llamas. These animals were known for their soft, woolly coats and their ability to carry heavy loads across the rugged terrain.

One llama in particular, named Luna, was very curious and adventurous. She loved to explore the mountains and valleys, discovering new sights and sounds along the way. One day, while wandering through a dense forest, Luna stumbled upon a hidden clearing filled with wildflowers. She had never seen such a beautiful sight before, and she couldn't resist taking a closer look.

As she wandered through the clearing, Luna noticed that the flowers were all different colors and shapes. Some were bright red, while others were soft pink or delicate purple. Some had petals that were sh
aped like bells, while others had petals that looked like tiny stars. Luna was fascinated by the diversity of the flowers and spent hours admiring their beauty.

As the sun began to set, Luna realized that she needed to return to her herd. She said goodbye to the wildflowers and promised to visit them again soon. From that day on, Luna made it a point to visit the
 hidden clearing whenever she could, always discovering new and exciting things to see and learn.

Luna's love for exploration and discovery inspired the other llamas in her herd to do the same. Together, they explored the mountains and valleys, discovering new sights and sounds that they had never experienced before. They learned about the different types of plants and animals that lived in their environment, and they developed a deep appreciation for the beauty and diversity of the natural world.

Years went by, and Luna grew old, but she never lost her love for adventure and discovery. She passed on her curiosity and enthusiasm to her children and grandchildren, who continued to explore and learn about the world around them. And every time they visited the hidden clearing of wildflowers, they remembered the stories that Luna had shared with them, and they felt grateful for the lessons she had taught them about the importance of exploration and appreciation for the natural world.

The story of Luna and her love for exploration was passed down through generations of llamas, inspiring them to always seek out new knowledge and experiences. And so, the legacy of Luna's curiosity and adventurous spirit lived on, reminding all llamas to never stop exploring and learning about the beautiful world around them. [end of text]

llama_print_timings:        load time =  2933.83 ms
llama_print_timings:      sample time =   226.26 ms /   569 runs   (    0.40 ms per token,  2514.77 tokens per second)
llama_print_timings: prompt eval time =  7362.43 ms /   143 tokens (   51.49 ms per token,    19.42 tokens per second)
llama_print_timings:        eval time = 195579.07 ms /   568 runs   (  344.33 ms per token,     2.90 tokens per second)
llama_print_timings:       total time = 203455.70 ms
Log end

For comparison, with 7B q4 model, the performance with and without -ngl is 87.60 tokens/s and 11.86 tokens/s correspondingly.

briansp2020 commented 1 year ago

@evshiron Could you help me get this working on my setup? I thought things were working but realized that inferencing was actually happening using my CPU. I have not been able to get PyTorch running properly on my GPU at all.

So, I'm now starting over again. My setup is Ubuntu 22.04 VM running in Proxmox server with 7900XTX in PCIe passthrough mode. Since I was able to run TensorFlow properly, I'm pretty sure the hardware and main OS is set up properly.

I did all my testing using various docker containers and TF seems to run correctly. PT however is giving me a lot of trouble.

So, to start from scratch, I ran your docker image using the following command

alias drun='docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $(pwd):/pwd' drun --name torch55 ghcr.io/evshiron/rocm_lab:rocm5.5-ub22.04-torch2.0.1

Inside the docker, I did

source venv/bin/activate

Then, I ran mnist.py from https://gist.github.com/AlkindiX/9c54d1155ba72415f3b585e26c9df6b3 and got this result https://gist.github.com/briansp2020/bbde07808cc360992721ccc16692047a

Train Epoch: 14 [58240/60000 (97%)] Loss: 0.001000 Train Epoch: 14 [58880/60000 (98%)] Loss: 0.455449 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.957592

Test set: Average loss: 0.0005, Accuracy: 10/10000 (0%)

Not sure what is wrong.

evshiron commented 1 year ago

@briansp2020

Please don't use the Docker images in this repo. The PyTorch included is outdated and functionalities are missing.

Set up a new venv, install PyTorch using the command at the very beginning of the repo's README, then try again.

You may just follow the instruction in the Gist you linked.

briansp2020 commented 1 year ago

I moved my 7900XTX to my main machine and pytorch now works. I don't know whether it's the fact that I was running it under Proxmox in a VM or using a different motherboard. I'm now running on a bare metal Ubuntu 22.04. It's weird since TensorFlow seemed fine in that setup.

Anyway, thanks for your help. I'll try the text gen stuff and report soon.

evshiron commented 1 year ago

As far as I know ROCm requires PCIe Atomics, so I suspect Proxmox's passthrough doesn't support that or require additional configuration. To be honest, I haven't tried GPU virtualization, nor have I attempted to run ROCm in a virtualized environment.

briansp2020 commented 1 year ago

I'm closing this since I'm no longer working with this docker and have used your guide to run models.

evshiron / rocm_lab

Error running ghcr.io/evshiron/rocm_lab:rocm5.5-text-gen-webui 7dea7110f293 #13