Closed klosax closed 1 year ago
Also trying to use offloading with any F32 model:
GGML_ASSERT: ggml-cuda.cu:5261: false
The model has shapes not divisible by 32
which is a hard requirement for quantized models:
llama_model_loader: loaded meta data with 19 key-value pairs and 48 tensors from /Users/ggerganov/development/huggingface/tinyllamas-stories-gguf/tinyllamas-stories-260k-q8_0.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor 0: token_embd.weight q8_0 [ 64, 512, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_down.weight q8_0 [ 172, 64, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_up.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_down.weight q8_0 [ 172, 64, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_up.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_down.weight q8_0 [ 172, 64, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_up.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_down.weight q8_0 [ 172, 64, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_up.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight q8_0 [ 64, 32, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight q8_0 [ 64, 64, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_down.weight q8_0 [ 172, 64, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_up.weight q8_0 [ 64, 172, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 46: output_norm.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 47: output.weight q8_0 [ 64, 512, 1, 1 ]
This model is so small (~1MB in F32) so quantization here most likely won't have any practical use anyway
The model has shapes not divisible by
32
which is a hard requirement for quantized models: This model is so small (~1MB in F32) so quantization here most likely won't have any practical use anyway
No, there is no use of quantization on a small model like this.
What about cuda offloading of F32 models?
GGML_ASSERT: ggml-cuda.cu:5261: false
Trying to use a quantized version of the ultra small (260k) tinyllamas model here: https://huggingface.co/klosax/tinyllamas-stories-gguf/blob/main/tinyllamas-stories-260k-f32.gguf
F32 and F16 works so no big deal on a small model like this, but this might be a bug or easy to fix?
Using cublas w/o offload:
CUDA error 12 at ggml-cuda.cu:5811: invalid pitch argument
Using cuda offloading:
GGML_ASSERT: ggml-cuda.cu:4981: false
w/o blas: ppl > 30k