Open fmac2000 opened 1 year ago
Hello,
Thank you for sharing this paper!
At this time I don't plan on integrating INT4 which would require using CUTLASS to define custom kernels. We are currently using cuBLAS for matrix multiplication.
Would it be reasonable to implement this as a CPU-only optimization? GGML supports this on CPU, but I'm not sure if that approach makes sense here or not.
Hi,
Would be great to have the possibility to integrate int4 quantization regarding the very interesting results in terms of performance and inference!
I see that the last few versions of opennmt have added support for 4bit and other quantization methods. https://forum.opennmt.net/t/opennmt-py-v3-3-released-following-3-2-with-plenty-of-new-features/5366
Might any of that be integrated into CTranslate2?
@guillaumekln Yes, 4bit quantization (on cpu) is a very much required feature. Any plans of taking this up?
Or maybe @ebraraktas can go one step further and implement 2bit and 3bit quantization using by taking clues from https://github.com/intel/neural-speed/pull/178
Hello Authors,
I apologise for asking questions unrelated to an issue with the repo however, would you consider support a newer paradigm I came across whilst reading a recent paper?
It looks incredibly promising and rather well written I must say, especially when considering the performance of such a precision. Is there anyone on the team able to give this a shot?