State of quantization - Githubissues

chengchingwen / Transformers.jl

Julia Implementation of Transformer models

MIT License

526 stars 75 forks source link

State of quantization #154

Open RomeoV opened 1 year ago

RomeoV commented 1 year ago

Hello, first of all thanks for this awesome package - this is very impressive!

We often run into resource constraints when running larger models from huggingface, even just for inference. A common strategy has been to apply quantization to model weights before running them.

Do you know if quantization has been succesfully applied at all in the Julia ML ecosystem? Is this something that might potentially be part of this package in the future?

chengchingwen commented 1 year ago

It's widely used in some transformer implementation like llama.cpp, but currently it's with low priority on my list to support in this package.

RomeoV commented 1 year ago

Small update, it seems that CUDA.jl supports Float16 natively, but Float8 and Float4 are not even datatypes in Base Julia -- although there an issue in CUDA.jl for supporting Float8, but it doesn't seem to be active.

However, without looking into the exact specifics, it seems that supporting Float16 should be possible by loading the weights from a Float16 quantized model.

RomeoV commented 1 year ago

Grepping for Float32 in this repo seems to suggest one would need to change

layers/embed.jl
layers/base.jl
huggingface/models/load.jl and then in the respective load.jl files for each model.