boun-tabi-LMG / turkish-lm-tuner

Turkish LM Tuner
https://boun-tabi-lmg.github.io/turkish-lm-tuner/
MIT License
73 stars 6 forks source link

Feature Request: GGUF format support #69

Open orkutmuratyilmaz opened 4 months ago

orkutmuratyilmaz commented 4 months ago

Hello and thanks for this beautiful repo,

Do you have plans to provide GGUF file? It would be great if we can have it.

Best, Orkut

onurgu commented 4 months ago

Hi, thanks for the interest. We're working on it 👍🏼

helizac commented 3 weeks ago

Hello, you can reach out to GGUF support at helizac/TURNA_GGUF and see a usage example at TURNA_GGUF_USAGE.ipynb.

Currently, only CPU usage is supported, but CUDA support will be implemented if huggingface/candle supports it. For more information, see this related issue.

llama-cpp does not support quantized-t5 model at the moment but will be implemented in case of improvements

I recommend using Q8_1 or Q8K models for efficiency. At the moment, these models generate 5-6 tokens per second.

gokceuludogan commented 3 weeks ago

That's great news! Thank you for your contribution. We look forward to the implementation of CUDA support.

onurgu commented 3 weeks ago

Thank you @helizac! How did you do this? llama.cpp repo was not supporting T5 models, I see there are some developments yesterday

https://github.com/ggerganov/llama.cpp/issues/5763

Did you do it yourself, if so where is the code?

helizac commented 3 weeks ago

Hello, unfortunately I did not make this development in llama.cpp repo issue mentioned, but I will try this branch and inform this issue. I implemented it in Rust language with the huggingface/candle framework as follows. I saw that CUDA support could be provided in some examples on the framework, but I encountered problems in the implementation part. I think CUDA support can be provided with a few changes via in: https://github.com/huggingface/candle/blob/main/candle-examples%2Fexamples%2Fquantized-t5%2Fmain.rs

Related issue: https://github.com/huggingface/candle/issues/2266

Currently, only CPU supported .gguf conversion process is below.

RUST_GGUF_CONVERT: https://colab.research.google.com/drive/1s97zTs8hfT0wyGTDHvs8cVOm9mVgXd9G?usp=sharing

With the methods in this notebook, TURNA can be used in .gguf format.

helizac commented 3 weeks ago

So, I tried the edited new t5 branch -> https://github.com/fairydreaming/llama.cpp/tree/t5 but it's not suitable for the TURNA at the moment.

At the beginning, t5 branch expects a spiece.model file. But TURNA is using hf tokenizers. I converted the code for hf tokenizer implementation. But I faced a second problem. Due to tensor models are defined by MODEL_TENSOR.DEC_FFN_UP: "decoder.block.{bid}.layer.2.DenseReluDense.wi and MODEL_TENSOR.ENC_FFN_UP: "encoder.block.{bid}.layer.1.DenseReluDense.wi"

in TENSOR_MODELS it didn't worked. Because TURNA expects:

INFO:hf-to-gguf:dec.blk.0.ffn_up.weight, torch.float32 --> F32, shape = {1024, 2816} INFO:hf-to-gguf:dec.blk.0.dense_relu_dense.wi_1.weight, torch.float32 --> F32, shape = {1024, 2816} INFO:hf-to-gguf:enc.blk.0.ffn_up.weight, torch.float32 --> F32, shape = {1024, 2816} INFO:hf-to-gguf:enc.blk.0.dense_relu_dense.wi_1.weight, torch.float32 --> F32, shape = {1024, 2816}

I defined the TENSORS on my own and I could export a .gguf output but llama won't work with it due to math calculations -> "error loading model vocabulary: Index out of array bounds in XCDA array!". For this purpose, it is necessary to examine it in detail and rewrite the functions in llama.cpp file.

For now, the previous huggingface/candle Rust framework implementation will be more comfortable to use. If GPU support comes soon, the model can be used this way easily: https://colab.research.google.com/drive/1s97zTs8hfT0wyGTDHvs8cVOm9mVgXd9G?usp=sharing (RUST_GGUF_CONVERT)