Open cmp-nct opened 1 year ago
I was mainly considering the feedback from some people that there are too many quantization options after the addition of the k-quants when I decided to make the 64-blocks a compile time option. But I can see that this is not very ergonomic for Falcon users. Let me think about a better solution.
Oh, and concerning fp16, I agree with you that it would be better if we standardized on fp16 for CUDA
Great to hear :) The amount of changes and features you commit regularly is astonishing.
In hindsight I was thinking that my 16 bit modifications to the dequantizers might have overshot it, maybe it would have been possible to just create a single wrapper that converts the kernels. I'm sure what I did to make it 16 bit was clumsy compared to the best possible solution, I did not do any CUDA before and I probably should have spent more time planning it out. So currently we have a 32 and a 16 bit representation for each block and row dequantization kernel, K and traditional Q type, a lot to maintain.
In the long run I'd personally perfer to cut all 32 bit out of cuda and go with the half precision, just all the sub functions appear to run on 32 bit so it's not a quick change.
Would a global block_size 64/256 variable introduce a downgrade in performance ? Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.
Would a global block_size 64/256 variable introduce a downgrade in performance ? Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.
The quantization type is known. There is no need for a global variable. All that is needed is to make separate types with 64 and 256 block sizes, and then decide which one to use when quantizing. After that everything will just work.
A large patch was just integrated into llama.cpp (https://github.com/ggerganov/llama.cpp/pull/2001) another stunning job by @ikawrakow
In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved: 1) We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit. Either by splitting and duplicating it or by using a global variable instead of the define. Otherwise we'd need distinctly compiled binaries for 7B and 40B 2) These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM. It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them. Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.
I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.