Open balisujohn opened 4 months ago
The max number of blocks in the y or z dimensions is 65535 (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities). In llama.cpp this dimension typically represents the batch size, so it is always much smaller than that. It would be good to handle this case properly, either by changing the kernel or by launching multiple kernels.
I found and isolated an error in
ggml_get_rows
with the cuda backend where for tensors of where the first dimension is greater than65535
, the program fails with the following output:I expect it's related to this: https://stackoverflow.com/questions/12078080/max-number-of-threads-which-can-be-initiated-in-a-single-cuda-kernel.
This doesn't seem to happen for similarly large tensors with the CPU backend.
I provided a reference repo which is a fresh fork of ggml with the error reproducibly and minimally demonstrated in
simple-backend.cpp
https://github.com/balisujohn/ggml-get-rows-errorAt a minimum, I think there should be an explicit guardrail (tensor exceeds allowed dimensions for this operation for this backend), but it would be nice if this operation can be extended to handle tensors that exceed
65535
, because without this tortoise.cpp will need to decompose these calls into something with a lot of slicing and concatting, which will probably be less efficient. I'm not against trying to make this change myself, but I want to hear other's thoughts before spending time on this.