Open NightMachinery opened 3 months ago
It's certain a possibility, thanks for raising this. One issue is that it isn't super well supported by the commercial companies, which are focused on chat. But I'll take a look and see how feasible it is.
@ahyatt The legacy completion API also supports fill-in-the-middle completions, using the suffix
argument for supplying the suffix that comes after the completion.
Here is an example:
prompt = "def say_hello("
suffix = """):
print('hi', name)"""
response = client.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=prompt,
suffix=suffix,
max_tokens=10
)
# response.choices[0].text should be name
This API is great for an autocomplete model. Unfortunately, OpenAI is the only cloud API for FIM completion that I could find. OpenRouter does not seem to support the suffix
argument. OpenAI also doesn't support it for GPT4.
I did some research on FIM support. Some open models also support FIM, though each uses its own prompt format to do so. E.g.,
Code Llama 2
<PRE> {prompt_prefix} <SUF>{prompt_suffix} <MID>
DeepSeek
<|fim▁begin|>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<|fim▁hole|>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|fim▁end|>
StarCoder
<fim_prefix>before <fim_suffix> after<fim_middle>
Right now, gpt-3.5-turbo-instruct
still seems to be the best FIM model available. I did not look for unofficial FIM finetunes of the stronger open models such as Llama 3 or Mixtral.
I suggest adding support for gpt-3.5-turbo-instruct
, as it is straightforward and strong. For the other models, I am unsure if emacs is a good place to maintain their FIM prompt template such that the end user uses a single FIM API. Perhaps this should be done by ollama.
Note that normal completion (without suffix
) is supported by tons of models on OpenAI-compatible servers such as Together, Fireworks, etc. So that should be supported by this package, regardless of FIM.
I found an open issue on Ollama about abstracting the FIM API:
This is great info, thanks. Having to do per-model work is a bit tricky because in GPT4All, we don't even know what model is being run. I'm not sure how we can handle that at the moment. Barring that problem, it seems possible but messy. I think supporting normal completion seems more likely, but that's also very model-dependent, and so has the same issues as GPT4All.
@ahyatt Thanks. Why are normal completions model dependent? Even chat-tuned models are internally run as next token predictors with a hardcoded prompt template, so the completion API should be possible for all local models, and some cloud models. The only model dependency is whether the cloud supports this API, but if it doesn't, it will give a 404 error anyway, so it could be effectively treated as if all models support the API.
For FIM, the ollama API above uses the underlying FIM API from llama.cpp, so that also might not be model-dependent. Of course, gpt-3.5-turbo-instruct
does not run on llama.cpp, so the FIM API might need some model-dependency.
Does this elisp package use GPT4All under the hood?
My understanding, perhaps flawed: some models are these raw foundation models, which basically operate as completion models. But many models are chat models (most of the popular models are like this), tuned to give complete responses, not designed to be used for completion. The results would be strange; I don't think you'd get an actual completion. So to use a completion API you have to use it with only models that are designed to be used that way.
And about GPT4All, yes, it's one of the options in the llm
package, so we should if possible design things to support this package as well. But it's the only package we support that gives no control to the caller of the llm
package about what model is being used.
@ahyatt Yes, chat-tuned models probably aren't that good for completion, but they can be used anyway. I don't think exposing the lower-level completion API (which can be used with an appropriate template to produce the higher-level chat responses) has any downsides at all. It allows more tinkering and extensibility, if anything.
Given a sufficiently long context, these chat-tuned models might even work as good completers. The chat RLHF lobotimizes the models, but when the context grows, the model returns back to its roots of next token prediction.
I think we'll have to see how it works in practice. I'll try to look into in the next few weeks, thanks for bringing it up!
@ahyatt Mistral’s new 22B code model also supports infilling:
This is usable both locally and via the API.
Thanks for the information. I've been taking a look at this, and the support is pretty spotty. I think the basic problem is that although the ability to do completions is how the foundation models work, companies like Open AI can't release the completion functionality from them due to lack of trust/safety filtering on the outputs. It is available in a deprecated API, but I'm not sure that's something that we want to depend on.
AFAICT, here's the current state of completion functionality:
Open AI: Prefix & infill completion, but with legacy API. Claude: Prefix completion API, not legacy. Gemini: Nothing Ollama: Prefix completion API, not legacy. Actual ability to do completions as completions varies by model.
That's not bad, I think we could conceivably implement this. However, I still wonder if it is necessary. You can get the same functionality by using the existing chat functionality. For example:
(llm-chat ash/llm-openai (llm-make-chat-prompt "Return the contents of <INFILL>. When I got home that morning, I noticed something was missing. <INFILL> That's when I knew I had been visited by clowns. INFILL="))
"my collection of colorful balloons had vanished, and there was confetti scattered all over the floor."
I'd first want to know if it really is unworkeable to use it via the chat interface this way. It doesn't seem particularly clunky to do this.
@ahyatt I actually have a chat-based implementation of infilling. It's slow, complex, heuristic-based, expensive, error-prone and nowhere near as good as GH Copilot. I have both of them bound to different hotkeys, and I almost always use GH Copilot. And GH Copilot is shit on my internet connection and often doesn't return any results. In those cases, I just give up. The chat-based solution is just not worth it, it's easier to type the code manually.
I am not saying it's impossible to work around the problem with more prompt engineering, but it's fighting an uphill battle for no real gain. We also need strong models for the chat based solution to work at all, while even a weak auto-regressive model should still be useful with the completion API.
ClosedAI et al. are indeed not likely to keep providing us with a completion API, and their current offerings aren't that good anyway. But open models should already be good enough. Specifically, this new Mistral Codestral should be quite good. DeepSeek Coder was also pretty good when I checked it out, though I only used the chat version manually.
I have heard that CodeGemma 2B is also a good autocomplete model. This model is so small that it doesn’t even have a chat-tuned version.
I've started a completion
branch, and made a basic ollama implementation. I've tried the CodeGemma model, but the quality wasn't very good.
I'm trying the following
(llm-complete-async (make-llm-ollama :completion-model "codegemma:latest")
"(defun fibonacci (n) " (lambda (x) (save-excursion (goto-char (point-max))
(insert x)))
#'ignore)
Like I feared, the result is more chat-like than completion-like. It may be possible to fix this by choosing just the right model. Also, the completion API in Ollama has no way to add a suffix, unless possibly by playing around with the template
, which is something I don't know enough about.
Please feel free to try it out and let me know if you can get a good quality answer out of it with various ollama models.
Thanks.
Ollama's infilling API has not been merged.
I implemented Mistral's FIM API myself. Testing it, it's not too bad, but it's not very good. So I guess you're right for now. Mistral's FIM seems even worse than my hacked together chat-based solution when using GPT4O.
My biggest problem with my hacky chat-based solution is how it doesn't return the exact part asked for. E.g., when asking to complete:
prefix: print("Hel
suffix:)
it might return world")
, while the correct answer is lo World"
.
Unfortunately, Codestral FIM API also has this problem.
I guess the Codestral API might be worth it becuse of cost, as they offer both pay-as-you-go and a monthly subscription for unlimited (with minutely rate limits) API usage.
I use copilot, and I notice it also has this exact problem. It's always trying to add an ending quote or paren when none is needed. It might be useful to have a multi-step system that can ask the LLM for followup, and that ability is something I am working on (slowly).
I think we can implement something like https://blog.jetbrains.com/blog/2024/04/04/full-line-code-completion-in-jetbrains-ides-all-you-need-to-know/ - postprocessing. E.g. we can delete some quotes and check if tree-sitter parsed tree is correct.
Also, new deepseek coder v2 lite also supports FIM
Another problem of Codestral FIM that I forgot to mention is that if one doesn't stop at the first newline character, the model can return a lot of stuff without paying attention to how it all fits together with the suffix. They themselves suggest using \n\n
as the stop sequence, but that wasn't usable for me. I ended up using \n
, which severely limits the completions. In general, their model is nowhere as good as GH Copilot. (I.e., it's just much less smart.)
I bet Codestral is very good model, but GH Copilot has much better context for requests.
@s-kostyaev Unlikely IMO. I use GH Copilot through emacs, which doesn't seem to do much magic. I also know exactly the context I am sending to Codestral, and the context is enough for a good completion, as models such as GPT4O show. Besides, Codestral is just bad at doing FIM irrespective of the code itself; Copilot really fills in the code and (mostly) respects parentheses etc., but Codestral feels like an autoregressive model that doesn't see the suffix and hasn't been trained to offer bite-sized completions. As I said, I have to stop Codestral on newline. If I allow it to generate multiple lines, it'll just output a big chunk of junk that doesn't integrate well with the given suffix.
OpenAI wants to prop up Copilot and is not releasing FIM, but Anthropic really should. Their Sonnet 3.5 is great.
@NightMachinery have you tried deepseek coder v2 lite for FIM?
It's fast and helpful, but I don't try it with FIM yet. It helps me create https://github.com/s-kostyaev/reranker in one day.
@s-kostyaev No. But I just tried Sonnet 3.5 with this template, and it works great:
User: Fill in at `<COMPLETE_HERE>` in the following snippet. Only output A SINGLE CODE BLOCK without any other explanations and START your output FROM THE BEGINNING OF THE LINE MARKED FOR COMPLETION. Ensure correct indentation.
Query:
;;; night-openai.el --- -- lexical-binding: t; -- ;;; (defvar night/openai-key (z var-get "openai_api_key")) (defun night/openai-key-get () night/openai-key)
(defvar night/openrouter-key (z var-get "openrouter_api_key")) (defun night/openrouter-key-get () night/openrouter-key)
(defvar night/groq
;;; (provide 'night-openai) ;;; night-openai.el ends here
Answer (with the correct indentation, in a code block, started from the beginning of the line marked for completion):
[2024-06-26 12:57:29] [Open AI --> Emacs]: ```
-key (z var-get "groq_api_key"))
(defun night/groq-key-get ()
night/groq-key)
Nice. Unfortunately I can't use it from my current location.
@s-kostyaev I use it via OpenRouter, it works in any country.
https://openrouter.ai/models/anthropic/claude-3.5-sonnet:beta
On the subject of building up a reasonable context, you both may be interested in my latest branch, which I intend to integrate in soon (maybe tomorrow). You can see the documentation here:
https://github.com/ahyatt/llm/tree/prompt#advanced-prompt-creation
And the code is in the branch. This should make coding completions better by providing a framework to define prompts which can be filled with generators, in the coding case, these would be generators that add segments of the code in order of most relevant to least relevant.
Feedback is welcome!
On Wed, Jun 26, 2024 at 5:53 AM Feraidoon Mehri @.***> wrote:
@s-kostyaev https://github.com/s-kostyaev I use it via OpenRouter, it works in any country.
https://openrouter.ai/models/anthropic/claude-3.5-sonnet
— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2191284966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZB64DRKKKTUACURMZDZJKFSDAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJRGI4DIOJWGY . You are receiving this because you were mentioned.Message ID: @.***>
Nice! I need to integrate it into Elisa and probably ellama. How token counts will be calculated? I know paid api providers has this functionality, but ollama hasn't. If based on word count it will not work on code.
Also I don't understand if I can use this new facility to fill top K chunks. If I give 5 tickets to one generator will it get 5 values from this generator if context size not exceeded? If so, how it will be formatted? If not I need this for RAG functionality and we need to think how to add it.
We mostly calculate token count with an approximation - 3/4 of the word
count (see llm-count-tokens
). We do this quite a bit in this algorithm,
so we want something cheap & fast. For Vertex, I implemented a way to
calculate based on an API call, but I think I should probably remove that,
since it would be too expensive for this algorithm. Plus, it's just hard
to calculate tokens ahead of time, which is one of the reasons by default
we just use 50% of the context window, because we know it might be
inaccurate. I'll mention this in the docs.
About filling top K chunks, the tickets don't represent the number of drawings, but the chance of "winning" a particular drawing and thus getting to fill a variable slot in the template, once, in a repeated set of drawings. So if you just have 1 variable with 5 tickets, it really doesn't matter how many tickets you have, because it will win every drawing, and just keep drawing until the desired prompt size is reached, or you run out of values. If you want to only supply 5 values, that's something you'd have to enforce on your side via the list or generators you are passing in.
On Sun, Jul 7, 2024 at 1:54 AM Sergey Kostyaev @.***> wrote:
Also I don't understand if I can use this new facility to fill top K chunks. If I give 5 tickets to one generator will it get 5 values from this generator if context size not exceeded?
— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2212333456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZBOP34HLSD75SEZWODZLDJZFAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGMZTGNBVGY . You are receiving this because you were mentioned.Message ID: @.***>
I like these context builders. I believe they should be optimized for interactive use for now; i.e., I want to manually chat with a model (e.g., on the Claude.ai website), I still need to provide some context to the chat.
Using it for automated IDE-like functionality such as FIM will probably not be worth it. The models have problems with basic syntax.
It means Tickets only represent probabilities. And in this implementation I need something like: {{top1:10}} {{top2:5}} {{top3:1}} {{top4:1}} ?
Yes, tickets are just probabilities (or, more accurately, numbers that are transformed into probabilities given the context of all the other variables and their tickets), I'd need to see a complete prompt template with those variables to make sense of what your intention is.
If you wanted to have the top5 results for a query, that's just one
variable, and it would just be filled until the tokens exceed the limit.
For example, if your prompt template was: "Answer the user's query given
these surrounding code segments: {{segments}}", when filling this template,
you could pass a generator that will yield each topk segment in order. It
can stop at 5 elements, or whatever number you want, or not stop at all, in
which case the llm-prompt
library will stop when it exceeds the token
limit.
On Sun, Jul 7, 2024 at 12:33 PM Sergey Kostyaev @.***> wrote:
It means Tickets only represent probabilities. And in this implementation I need something like: {{top1:10}} {{top2:5}} {{top3:1}} {{top4:1}} ?
— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2212502215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZGUDRKOFQHDJ3RVSTLZLFUUNAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGUYDEMRRGU . You are receiving this because you were mentioned.Message ID: @.***>
https://github.com/ollama/ollama/pull/5207 merged. In new ollama release will be FIM support.
Is there a way to do so-called legacy completions?
This API is lower-level than the ChatML API, and usable with base models, which can be quite strong as an auto-completer.
Related: