Open johnburkey opened 1 year ago
It's untested, I'll see to it Most likely not a big thing
Thanks I got around the alignment assert but the restored context looked off - happily returning an unrelated answer - guessing you know how to debug that 1000x faster than me .
For the alignment I just added aligned allocs and noticed the size of func was returning size of element instead of size of buf so just set it to the alloced size.
Let me know if you need anything from me.
And I have some statistics buddies if you are still diving on k clusters - who knows they may have cool ideas
Feel free to push a PR with your fix :)
Regarding work on k-quants: Those are always very time intense as they need to work with cuda implementations (optimally also with metal and opencl), need to be integrated into a ton of code and maintained. So it's an area where any change can cause a chain of changes. Involvement is always welcome, there is a lot that can be improved and optimized.
ill see what I can do about the alignment fix, but hiding behind that was the context didn't seem to be restored correctly (it replied to a different question than asked. ) Ill check it again and report back.
Yeah there seems to be enough bugs in there I don't think I should check in. It's crashing with a buffer overrun in ggml_compute_forward_dup_f32 during context saving for this model : falcon7b-instruct.ggmlv3.q4_0.bin. (from https://huggingface.co/TheBloke/falcon-7b-instruct-GGML)
If you get that context saving working ill beat on it a bit- I think its interesting stuff.
Yeah there seems to be enough bugs in there I don't think I should check in. It's crashing with a buffer overrun in ggml_compute_forward_dup_f32 during context saving for this model : falcon7b-instruct.ggmlv3.q4_0.bin. (from https://huggingface.co/TheBloke/falcon-7b-instruct-GGML)
I've never seen that happening (in thousands of tests). The v3 version is quite outdated but should work. Can you provide the full command line and output log of it happening? Also ensure you have the latest version, without tokenizer.json that v3 should not start.
ok, pulled fresh, same issues- Im thinking maybe its because im building on the Mac without DLLAMA_CUBLAS - because the Mac has its own BLAS (Accelerate). It builds fine, but im guessing is executing a few of the ifdefs flipped different ways? Is there anyway for you to run that way (without CUBLAS?) and see if you get the crash? Or would you like me to continue investigating in some way?
I ran it with these args --prompt-cache promptCache -t 8 -ngl 100 -b 1 -m ..../falcon7b-instruct.ggmlv3.q4_0.bin -p "What is a falcon?\n### Response:"
It crashes immediately typically in that same routine (ggml_compute_forward_dup)
I am getting good performance on M1 Max by the way- just want to use the prompt cache :-)
I am currently focused on another area I need to finish before I can look at sessions and prompt cache If it only crashes with the prompt cache then it's that feature bugging.
I could not test it on MAC but I have an idea what might play into it. The "dup" is used internally in most ggml functions, so without a debug backtrace we don't know where it crashed. (gdb --args ..., >r, >bt)
We use a modified form of multiplication broadcasting and this might play into it if the metal kernel is used for the multiplication (no broadcasting support). Normally it should assert a wrong tensor shape though. As a very quick test you can go to libfalcon.cpp and look for "use_broadcasting = true;" Change that to "false" which would change tensors to avoid that feature. The real solution is to update the mac kernels or find a way to make them push that job to normal cpu.
from ggml_compute_forward_dup_f32()
if (src0->type == dst->type && ne00 == ne0 && nb00 == GGML_TYPE_SIZE[src0->type] && nb0 == GGML_TYPE_SIZE[dst->type]) { // copy by rows const size_t rs = ne00*nb00; for (int64_t i03 = 0; i03 < ne03; i03++) { for (int64_t i02 = 0; i02 < ne02; i02++) { for (int64_t i01 = ir0; i01 < ir1; i01++) {
// its this memcpy // its the src ptr thats bad (or in any case not big enough for the cpy)
memcpy(
((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3),
((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03),
rs);
}
}
}
return;
}
use_broadcasting = false doesn't change behavior. And its crashing on save.
I'll take a deeper look at it. From first glance: It can not work at the moment, it's not made for multiquery attention. The function converts the kv cache to 3d, stores it that way. Which I don't understand given that kv is internally a 1d tens.
Just for my understanding, I've not used the sessions before: What is the main purpose for you ? Are you processing a long input and wish to work with that with multiple "post-prompts" ?
you know the fact that the parameters for the loop come from source implies that the src tensor might just be misconfigured, and that its not a type mismatch but just the tensor misreporting during the llama_save_session_file call of llama_copy_state_data (like the overrun is the source tensor says its a different dim than it is, so the loop overruns its pull of data to pump to the dest)
No, such problems always look like that. It's a heap buffer overflow which destroys the tensor content, so the assertion breaks given the tensor looks stupid now.
I've just pushed a bugfix which appears to work. I tested it with a 1kb file and it loaded and stored correctly. The session functionality is not fully supported yet with the new features. so using "-enc" or the system prompt will break it.
sweet will test
-enc and -sys should also work with it by now
works so far, testing enc/sys don't know what they do, so will read about that first :-)
Prerequisites
Please answer the following questions for yourself before submitting an issue.