Support for 4 bit Quantization

vikigenius commented 1 year ago

Language model progress has been rapid recently and with the llama weights being released, so much progress is being made on the c++ side

https://github.com/ggerganov/llama.cpp

I see that fp16 is on the roadmap soon.

But it might also be a good idea to consider support for 4 bit quantization and related techniques. Is that something that will be considered?

coreylowman commented 1 year ago

Does anyone know how they represent tensors with 4 bit data? Would this be some packed structure where they store 2 i4s with a u8?

struct i4x2(u8)

nkoppel commented 1 year ago

Here is a relevant description of the implementation llama.cpp uses to represent floats in 4 bits. It essentially boils down to storing some number of 4 bit integers along with a f32 scaling factor and an optional f32 offset. From what I have read of source code, it seems possible to do a lot of the math highly efficiently on the cpu using SIMD packed 8/16 bit integers, and without touching floating point at all.

While this representation is excellent for specialized inference libraries, I don't think that it's practical for a generalist library like dfdx, because dfdx must deal with strided indexing and must support cuda. Furthermore, I'm not sure how efficient we could make operations on this representation, considering that simd support in stable rust is very limited.

vikigenius commented 1 year ago

Yep, I was also looking into this, this would be very nice to have, but I am not sure if we can make it even remotely approach the efficiency of what the C++ people are doing, considering how general purpose dfdx is.

I think the SIMD concerns are not that big of a deal, since dfdx heavily relies on nightly for some features already anyway and the SIMD support there is ok from what I have seen so far.

But I take your point on how specialized this is, and I am not sure if it is worth the effort to have this representation as an option for dfdx tensors.

coreylowman / dfdx

Support for 4 bit Quantization #580