cmp-nct / ggllm.cpp

Falcon LLM ggml framework with CPU and GPU support
Other
244 stars 21 forks source link

Does the prompt cache work? I got an alignment assert when I turned it on. #42

Open johnburkey opened 1 year ago

johnburkey commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

cmp-nct commented 1 year ago

It's untested, I'll see to it Most likely not a big thing

johnburkey commented 1 year ago

Thanks I got around the alignment assert but the restored context looked off - happily returning an unrelated answer - guessing you know how to debug that 1000x faster than me .

For the alignment I just added aligned allocs and noticed the size of func was returning size of element instead of size of buf so just set it to the alloced size.

Let me know if you need anything from me.

johnburkey commented 1 year ago

And I have some statistics buddies if you are still diving on k clusters - who knows they may have cool ideas

cmp-nct commented 1 year ago

Feel free to push a PR with your fix :)

Regarding work on k-quants: Those are always very time intense as they need to work with cuda implementations (optimally also with metal and opencl), need to be integrated into a ton of code and maintained. So it's an area where any change can cause a chain of changes. Involvement is always welcome, there is a lot that can be improved and optimized.

johnburkey commented 1 year ago

ill see what I can do about the alignment fix, but hiding behind that was the context didn't seem to be restored correctly (it replied to a different question than asked. ) Ill check it again and report back.

johnburkey commented 1 year ago

Yeah there seems to be enough bugs in there I don't think I should check in. It's crashing with a buffer overrun in ggml_compute_forward_dup_f32 during context saving for this model : falcon7b-instruct.ggmlv3.q4_0.bin. (from https://huggingface.co/TheBloke/falcon-7b-instruct-GGML)

johnburkey commented 1 year ago

If you get that context saving working ill beat on it a bit- I think its interesting stuff.

cmp-nct commented 1 year ago

Yeah there seems to be enough bugs in there I don't think I should check in. It's crashing with a buffer overrun in ggml_compute_forward_dup_f32 during context saving for this model : falcon7b-instruct.ggmlv3.q4_0.bin. (from https://huggingface.co/TheBloke/falcon-7b-instruct-GGML)

I've never seen that happening (in thousands of tests). The v3 version is quite outdated but should work. Can you provide the full command line and output log of it happening? Also ensure you have the latest version, without tokenizer.json that v3 should not start.

johnburkey commented 1 year ago

ok, pulled fresh, same issues- Im thinking maybe its because im building on the Mac without DLLAMA_CUBLAS - because the Mac has its own BLAS (Accelerate). It builds fine, but im guessing is executing a few of the ifdefs flipped different ways? Is there anyway for you to run that way (without CUBLAS?) and see if you get the crash? Or would you like me to continue investigating in some way?

I ran it with these args --prompt-cache promptCache -t 8 -ngl 100 -b 1 -m ..../falcon7b-instruct.ggmlv3.q4_0.bin -p "What is a falcon?\n### Response:"

It crashes immediately typically in that same routine (ggml_compute_forward_dup)

johnburkey commented 1 year ago

I am getting good performance on M1 Max by the way- just want to use the prompt cache :-)

cmp-nct commented 1 year ago

I am currently focused on another area I need to finish before I can look at sessions and prompt cache If it only crashes with the prompt cache then it's that feature bugging.

I could not test it on MAC but I have an idea what might play into it. The "dup" is used internally in most ggml functions, so without a debug backtrace we don't know where it crashed. (gdb --args ..., >r, >bt)

We use a modified form of multiplication broadcasting and this might play into it if the metal kernel is used for the multiplication (no broadcasting support). Normally it should assert a wrong tensor shape though. As a very quick test you can go to libfalcon.cpp and look for "use_broadcasting = true;" Change that to "false" which would change tensors to avoid that feature. The real solution is to update the mac kernels or find a way to make them push that job to normal cpu.

johnburkey commented 1 year ago

from ggml_compute_forward_dup_f32()

if (src0->type == dst->type && ne00 == ne0 && nb00 == GGML_TYPE_SIZE[src0->type] && nb0 == GGML_TYPE_SIZE[dst->type]) { // copy by rows const size_t rs = ne00*nb00; for (int64_t i03 = 0; i03 < ne03; i03++) { for (int64_t i02 = 0; i02 < ne02; i02++) { for (int64_t i01 = ir0; i01 < ir1; i01++) {

// its this memcpy // its the src ptr thats bad (or in any case not big enough for the cpy)

                memcpy(
                    ((char *)  dst->data + i01*nb1  + i02*nb2  + i03*nb3),
                    ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03),
                    rs);
            }
        }
    }
    return;
}
johnburkey commented 1 year ago

use_broadcasting = false doesn't change behavior. And its crashing on save.

cmp-nct commented 1 year ago

I'll take a deeper look at it. From first glance: It can not work at the moment, it's not made for multiquery attention. The function converts the kv cache to 3d, stores it that way. Which I don't understand given that kv is internally a 1d tens.

Just for my understanding, I've not used the sessions before: What is the main purpose for you ? Are you processing a long input and wish to work with that with multiple "post-prompts" ?

johnburkey commented 1 year ago

you know the fact that the parameters for the loop come from source implies that the src tensor might just be misconfigured, and that its not a type mismatch but just the tensor misreporting during the llama_save_session_file call of llama_copy_state_data (like the overrun is the source tensor says its a different dim than it is, so the loop overruns its pull of data to pump to the dest)

cmp-nct commented 1 year ago

No, such problems always look like that. It's a heap buffer overflow which destroys the tensor content, so the assertion breaks given the tensor looks stupid now.

I've just pushed a bugfix which appears to work. I tested it with a 1kb file and it loaded and stored correctly. The session functionality is not fully supported yet with the new features. so using "-enc" or the system prompt will break it.

johnburkey commented 1 year ago

sweet will test

cmp-nct commented 1 year ago

-enc and -sys should also work with it by now

johnburkey commented 1 year ago

works so far, testing enc/sys don't know what they do, so will read about that first :-)