Open vineel96 opened 5 days ago
I think this is because the lack of support for BF16 ops on the GPU as mentioned in https://github.com/ggerganov/llama.cpp/issues/9881#issuecomment-2414516092. Perhaps you can try a F16 model instead.
@danbev, Even FP16 falcon mamba model suffers same error. LLAMA-3 FP16 and BF16 does not give above error when using -ngl option.
This happens because the CUDA backend does not support the norm operation on non-contiguous tensors, and it does not report it correctly in the supports_op
function. This should be fixed, however, the CUDA backend also does not support the mamba specific operations, so there will be no benefit to offloading mamba models until these are implemented.
@slaren, thanks. Can you specify files where cuda implementations are defined? Any plan/PR in progress for CUDA support for mamba model? Also do mamba has all kernel support for CPU specific operations (like parallel scan algorithm of mamba)?
What happened?
Hello, I have encountered error when I run llama.cpp with falcon mamba model where its layers are offloaded to GPU.
Steps to reproduce:
Observations:
Is there support for falcon mamba on GPU?
Name and Version
version: 3902 (c81f3bbb) built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for aarch64-redhat-linux
What operating system are you seeing the problem on?
Linux
Relevant log output