Open bartowski1182 opened 3 months ago
Happens with the 128k variants as well. I tried both!
woops, thanks, forgot and got lazy with pasting links lol
Try all models using https://github.com/ggerganov/llama.cpp/pull/7225 and report any issues
Building now, will report any updates
Also will try running created quants with those changes just to see if it works (need these changes for imatrix though of course)
Will llamacpp work with blocksparse attention? These models seems to implement it.
Do we also get vision support for Phi-3-Vision? I don't know how much this diverses from other archs like LLaVA.
Will llamacpp work with blocksparse attention? These models seems to implement it.
Could you remind me what was blocksparse attention?
Edit: No, there is no API for that atm.
@qnixsynapse Is this technique actually used in practice? I don't see how one would choose the attention mask in a reasonable way without the LLM "forgetting" important bits from the context
Normally, I will not prefer it, however, I am seeing this which caught my attention.
I tried it with #7225 using the 128k variants:
./llama-server --chat-template phi3 -m ../../models/Phi-3-medium-128k-instruct-iMat-GGUF/phi-3-medium-128k-instruct-bf16.gguf &
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is 2+2 ?"
}
]
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":" The result of 2+2 is 4.","role":"assistant"}}],"created":1716312775,"model":"model_name","object":"chat.completion","usage":{"completion_tokens":12,"prompt_tokens":18,"total_tokens":30},"id":"chatcmpl-KvOpfd64IzSt8DR7h3smbyP7PyVc6xPG"}
bf16 gguf creation still fails with:
INFO:hf-to-gguf:Loading model: Phi-3-small-128k-instruct
Traceback (most recent call last):
File "/home/tristand/ai/tools/llama.cpp/convert-hf-to-gguf.py", line 2585, in <module>
main()
File "/home/tristand/ai/tools/llama.cpp/convert-hf-to-gguf.py", line 2563, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tristand/ai/tools/llama.cpp/convert-hf-to-gguf.py", line 370, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'Phi3SmallForCausalLM' not supported!
I tried the dumb fix for the Phi3SmallForCausalLm not supported:
diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 06c89e23..9d6f861a 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1685,7 +1685,7 @@ class Phi2Model(Model):
self.gguf_writer.add_add_bos_token(False)
-@Model.register("Phi3ForCausalLM")
+@Model.register("Phi3ForCausalLM", "Phi3SmallForCausalLM")
class Phi3MiniModel(Model):
model_arch = gguf.MODEL_ARCH.PHI3
And now it fails with:
INFO:hf-to-gguf:Set model parameters
Traceback (most recent call last):
File "/home/tristand/ai/tools/llama.cpp-fix-phi/convert-hf-to-gguf.py", line 2585, in <module>
main()
File "/home/tristand/ai/tools/llama.cpp-fix-phi/convert-hf-to-gguf.py", line 2567, in main
model_instance.set_gguf_parameters()
File "/home/tristand/ai/tools/llama.cpp-fix-phi/convert-hf-to-gguf.py", line 1791, in set_gguf_parameters
rms_eps = self.find_hparam(["rms_norm_eps"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tristand/ai/tools/llama.cpp-fix-phi/convert-hf-to-gguf.py", line 113, in find_hparam
raise KeyError(f"could not find any of: {keys}")
KeyError: "could not find any of: ['rms_norm_eps']"
using that PR imatrix works which should likely imply that generation will work, old created ones don't so any that are floating out there without the PR will be broken
It seems to kinda work for Phi3-medium-128k with #7225, but it breaks along the way with partial offloading (see last line)
Try increasing the context, you are using only 512 so there are probably shifts happening.
That did the trick
tried with https://github.com/ggerganov/llama.cpp/pull/7225 and it worked but only with this version (PR 7225):
But if you use main from latest version to run gguf files created with this version, it will show this error:
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 245, got 243
with what model did this work for you? with convert-hf-to-gguf.py? probably medium? or if small, how did you get around tokenizer.model?
Can anyone post working f16?
So far so good on the 4k GGUF, it's able to respond to queries which is good enough for me lol
uploaded here:
https://huggingface.co/bartowski/Phi-3-medium-4k-instruct-GGUF
It's loading now and working great with bartowski/Phi-3-medium-4k-instruct-GGUF/Phi-3-medium-4k-instruct-Q4_K_S.gguf on LM-Studio. Thanks
looks great with medium. but small seems to need more work. in a first test, i can override some gguf_parameters and just use self._set_vocab_qwen() and it will convert and quantize, but then it won't run but throw "llama_model_load: error loading model: check_tensor_dims: tensor 'output.weight' not found".
https://0.0g.gg/?8b6aa2a822f73b75#6dSFckfnxCttPKUX7rX4b35WEdt6woLdK65DTpSWSZ4w
here is an issue ive been running into. Link is a paste of the model just completely imploding in on itself from a basic word problem.
Btw I think if you're using something like LM studio you aren't getting the right performance
It fails the tokenizer test of 3333+7777, but using the PR ./main gets it right
Likely need to wait for merge and version bump
I am using a bleeding edge llama.cpp commit and its doing that which is odd...
Something is wrong with Phi-3-medium-4k-instruct output, I am often getting weird "B:" out of the blue:
launched via:
server -v -ngl 99 -m Phi-3-medium-4k-instruct-Q6_K.gguf -c 4096 --chat-template chatml
configuration:
using current master (9b3d83318931aa98c487baaa977626931d059e6a).
@MoonRide303
Something is wrong with Phi-3-medium-4k-instruct output, I am often getting weird "B:" out of the blue:
AFAIK the --chat-template
parameter is not used for the server web GUI as it uses the /completions
endpoint internally and the --chat-template
only applies to the /v1/chat/completions
endpoint, you need to set the right template in the Prompt template section of the form manually.
@MoonRide303 @tristandruyen Just FYI - it looks like that --chat-template
issue is being worked in this issue - https://github.com/ggerganov/llama.cpp/issues/7432 - which is great because I ran into that same problem with Phi 3 mini! The --chat-completion
fix worked in my case but I'm looking forward to the real fix in https://github.com/ggerganov/llama.cpp/pull/7449 being merged.
@MoonRide303 @tristandruyen Just FYI - it looks like that
--chat-template
issue is being worked in this issue - #7432 - which is great because I ran into that same problem with Phi 3 mini! The--chat-completion
fix worked in my case but I'm looking forward to the real fix in #7449 being merged.
I'm actually the one working on the --chat-template issue in #7449. However, it seems like @MoonRide303's issue is more related to the web ui not using any model specific templates, not that it's using the wrong one.
The fix I'm working on in #7449 aims to improve the auto-detection of the phi3 template, so users won't have to explicitly specify it using the --chat-template flag. This fix will ensure that llama.cpp automatically detects and uses the appropriate template for the model.
It's important to note that the behavior of the endpoints and the web UI will remain unchanged after my fix is merged. The web UI will still not use any model-specific template, just the auto-detection process will be more reliable.
Will llamacpp work with blocksparse attention? These models seems to implement it.
Could you remind me what was blocksparse attention?
Edit: No, there is no API for that atm.
@ggerganov thanks for your interest on supporting phi-3-small.
I am the author of the blocksparse attention in phi-3-small. I am not very familiar with ollama, but I could help explain the detail.
The kernel is implemented in Triton, but you can find the code that generates the dense version of attention mask here https://github.com/linxihui/vllm/blob/eb16d9a382f273c3ed62e4264a42a24f6ba53568/vllm/attention/ops/blocksparse_attention/utils.py#L187C1-L210C57
There is a also a vllm paged attention version https://github.com/linxihui/vllm/blob/main/csrc/attention/attention_kernels.cu#L224-L236
I tested other models with ollama on my mac, it is super responsive and cool, hope I could have our phi-3-small model on my mac as well!
@linxihui Thanks for the information. Is my understanding correct that the vllm implementation skips non-attended blocks (i.e. full of -inf
) which makes the computation faster? Do you have an estimate of the performance gain if that is the case? This does not lead to lower memory usage, correct?
If my understanding is correct, then I think we can easily support this in the Metal backend as the block-skipping logic is already there
@linxihui Thanks for the information. Is my understanding correct that the vllm implementation skips non-attended blocks (i.e. full of
-inf
) which makes the computation faster? Do you have an estimate of the performance gain if that is the case? This does not lead to lower memory usage, correct?If my understanding is correct, then I think we can easily support this in the Metal backend as the block-skipping logic is already there
@ggerganov
The current implementation in vllm doesn't improve memory, that's correct. This is because the block table in vllm isn't per head (it is per toke). Our attention may attend to all tokens (smaller vertical stride) but on different heads. To save memory, we'll need to change a very big portion of the core vllm, which we don't have the time for it. But theoretically, you could save memory with proper implementation.
Latency/throughput wise, it is faster, as it skip blocks both in prefilling and decoding. The theoretical flops is 1/vert_stride of dense with very large length. In vllm paged attn, we skip the compute of qk, attn*v. The filling of -"inf" is only for the normalization of the logits. So it should be faster as well. The end2end benefit can only be observed when large length when attention occupies the big proportion, compared to other ops and loading of model weights. E.g., if the model is full of blocksparse attention, it could be more than 4x faster end2end in prefilling with 100k context length.
Yes, the logic should be easy to implement in existing profiling and decoding code. But make sure you pay close attention to the head sliding part.
still can't convert phi-3-small :( Phi3SmallForCausalLM unsupported :(
perhaps this can help? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/phi3_small.py
This issue was closed because it has been inactive for 14 days since being marked as stale.
Please, can you reopen this, we need phi-3 small.
I agree on that, 4k context is simply not enough.
FYI: Microsoft has just released Phi3.5 models, with mini version having 128k context. See https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3. It doesn't have GGUF quants yet, because... because of this issue. Let's get to it! 💪
Edit: Just tested https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF with context size of 8192, works good, which fits my usecase.
2 new models released from Microsoft:
https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/
Medium uses Phi3ForCausalLM and converts without issue, but when trying to generate has an invalid tensor shape:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_qkv.weight' has wrong shape; expected 5120, 15360, got 5120, 7680, 1, 1
And then Small uses a new Architecture tag 'Phi3SmallForCausalLM'