abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.27k stars 866 forks source link

The model freezes and does not use my GPU #285

Closed LeLaboDuGame closed 1 year ago

LeLaboDuGame commented 1 year ago

Expected Behavior

Hello ! I'm a bit young so I dont speak verry good in english. I discover llama not long ago. I was immediately interested in llama cpp python because it is a simple way for me to integrate into llama projects. So basically I have this code:

from llama_cpp import Llama

class IA:
    def __init__(self, model_path):
        self.llm = Llama(model_path=model_path, n_gpu_layers=128, n_ctx=4048, use_mlock=True)
        self.msgs = [{"role": "system", "content": "A dialogue between User and Assistant"}]
        print("model load !")

    #We don't fucking care about that
    def get_main_prompt(self, path="./MainPrompt.nai"):
        main_prompt = ""
        for i in open(path, "r+").readlines():
            main_prompt += i
        return main_prompt

    def prompt(self, msg):
        print("Prompting...")
        self.msgs.append({"role": "user", "content": msg})
        prompt = self.llm.create_chat_completion(
            messages=self.msgs,
            stop=["User:", "Assistant:"], max_tokens=100)["choices"][0]["message"]
        self.msgs.append(prompt)
        print("Prompting finished !")
        self.llm.save_state()
        return prompt

ia = IA("D:\\ia\\ia\\ggml-model-q4_0_13b.bin")
for i in range(10):
    print(ia.prompt(input(">>> ")))

It is wise to prompt me my message and display it to me.

The exit:

C:\Users\ad\PycharmProjects\NastorProject\venv\Scripts\python.exe C:\Users\ad\PycharmProjects\NastorProject\test.py 
llama.cpp: loading model from D:\ia\ia\ggml-model-q4_0_13b.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.70 MB
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1032'
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3162.50 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
model load !
>>> Hey how are you ?
Prompting...

llama_print_timings:        load time =  5512.88 ms
llama_print_timings:      sample time =     6.30 ms /    10 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time =  5512.28 ms /    28 tokens (  196.87 ms per token)
llama_print_timings:        eval time =  2574.71 ms /     9 runs   (  286.08 ms per token)
llama_print_timings:       total time =  8479.96 ms
Prompting finished !
Llama.save_state: saving 548519980 bytes of llama state
{'role': 'assistant', 'content': ' The weather is good today.'}
>>>

The response is bad but you understand

But unfortunately I have a lot of lag! First of all, my ram saturates quickly and I freeze, which makes the situation frustrating (I had to restart my pc, it's so buggy)

I use CLBLAST for my amazing AMD graphic card

The problem is here I am using CLBLAST but my gpu stays at 0-3% usage ... So I try to increment n_çgpu_layers parameter but still not working (I dont know what is it so ..)

Current Behavior

AMD 6600XT i5 10400f 16 go 3200mhz Windows 11


Thanks in advance ! PS: I use a 13b q4_0 llama model

gjmulder commented 1 year ago

There's reports from the upstream llama.cpp that CLBlast is not well supported.

You need 16GB of RAM for a 13B model. Try using the 7B model. Try just running the 7B model w/OpenBLAS and see how that goes.

TheTerrasque commented 1 year ago

There's reports from the upstream llama.cpp that CLBlast is not well supported.

It should be well supported, there was a pretty good PR merged in around 6 days ago. Got any reference to this from after that PR?

On a side note, having similar problems getting clblast gpu acceleration working with llama-cpp-python. CLblast builds from llama.cpp releases works as expected.

gjmulder commented 1 year ago

It should be well supported, there was a pretty good PR merged in around 6 days ago. Got any reference to this from after that PR?

You're right. As of over a week ago there were issues, but it seems things are a lot stabler post that PR.

LeLaboDuGame commented 1 year ago

Hey! I going to see that so

LeLaboDuGame commented 1 year ago

So have I to try OpenBlas ? and can OpenBlas work with my amd gpu ?

gjmulder commented 1 year ago

OpenCL support for AMD GPUs seems to have been added to llama.cpp. The latest llama-cpp-python looks to include the version of llama.cpp that adds OpenCL support.

OpenBLAS is CPU only, sorry.

LeLaboDuGame commented 1 year ago

Ok ok ! so I retry with CLBLAST and it work ! but I have some other problem not relative to the main topic...

I think I have to repost a nex topic but it's just some simple questions. 1-I see that the model store old convertional prompt because when I retsart completly the program he gives me old tokens. I want so to reset the model and I dont know how to do it... 2-I see that the model dont recognize old tokens and he have to reeval each time beetwen two chat_completion so it takes a big time to generate and I want to reduce that ! I see that on kobbold.cpp eval was made just one time and after the model knows whats are sayed. 3-How to generate token per token and get them in a for to maybe print theme procedurally.

Thanks in advance !

gjmulder commented 1 year ago

Yes, please open a new ticket describing what you expected to happen and what actually happened.

Copy/paste the text output and use Github markdown to make it easy to read your examples.