Open RomeoV opened 1 year ago
It's widely used in some transformer implementation like llama.cpp, but currently it's with low priority on my list to support in this package.
Small update, it seems that CUDA.jl
supports Float16
natively, but Float8
and Float4
are not even datatypes in Base Julia -- although there an issue in CUDA.jl for supporting Float8, but it doesn't seem to be active.
However, without looking into the exact specifics, it seems that supporting Float16
should be possible by loading the weights from a Float16 quantized model.
Grepping for Float32 in this repo seems to suggest one would need to change
load.
jl files for each model.
Hello, first of all thanks for this awesome package - this is very impressive!
We often run into resource constraints when running larger models from huggingface, even just for inference. A common strategy has been to apply quantization to model weights before running them.
Do you know if quantization has been succesfully applied at all in the Julia ML ecosystem? Is this something that might potentially be part of this package in the future?