[Q/FR] "Legacy" Completion

NightMachinery commented 3 months ago

Is there a way to do so-called legacy completions?

https://platform.openai.com/docs/guides/text-generation/completions-api

https://platform.openai.com/docs/api-reference/completions/create

res = openrouter_client.completions.create(
model="mistralai/mixtral-8x22b",
prompt="""...""",
stream=True,
echo=False, #: Echo back the prompt in addition to the completion
max_tokens=100,
)

This API is lower-level than the ChatML API, and usable with base models, which can be quite strong as an auto-completer.

https://github.com/s-kostyaev/ellama/issues/108#issuecomment-2055021113

ahyatt commented 3 months ago

It's certain a possibility, thanks for raising this. One issue is that it isn't super well supported by the commercial companies, which are focused on chat. But I'll take a look and see how feasible it is.

NightMachinery commented 2 months ago

@ahyatt The legacy completion API also supports fill-in-the-middle completions, using the suffix argument for supplying the suffix that comes after the completion.

Here is an example:

prompt = "def say_hello("
suffix = """):
  print('hi', name)"""

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    suffix=suffix,
    max_tokens=10
)
# response.choices[0].text should be name

This API is great for an autocomplete model. Unfortunately, OpenAI is the only cloud API for FIM completion that I could find. OpenRouter does not seem to support the suffix argument. OpenAI also doesn't support it for GPT4.

NightMachinery commented 2 months ago

I did some research on FIM support. Some open models also support FIM, though each uses its own prompt format to do so. E.g.,

Code Llama 2

Structured FIM API

<PRE> {prompt_prefix} <SUF>{prompt_suffix} <MID>

DeepSeek

<｜fim▁begin｜>def quick_sort(arr):
if len(arr) <= 1:
    return arr
pivot = arr[0]
left = []
right = []
<｜fim▁hole｜>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<｜fim▁end｜>

StarCoder

API

<fim_prefix>before <fim_suffix> after<fim_middle>

Right now, gpt-3.5-turbo-instruct still seems to be the best FIM model available. I did not look for unofficial FIM finetunes of the stronger open models such as Llama 3 or Mixtral.

I suggest adding support for gpt-3.5-turbo-instruct, as it is straightforward and strong. For the other models, I am unsure if emacs is a good place to maintain their FIM prompt template such that the end user uses a single FIM API. Perhaps this should be done by ollama.

Note that normal completion (without suffix) is supported by tons of models on OpenAI-compatible servers such as Together, Fireworks, etc. So that should be supported by this package, regardless of FIM.

NightMachinery commented 2 months ago

I found an open issue on Ollama about abstracting the FIM API:

https://github.com/ollama/ollama/issues/3869

ahyatt commented 2 months ago

This is great info, thanks. Having to do per-model work is a bit tricky because in GPT4All, we don't even know what model is being run. I'm not sure how we can handle that at the moment. Barring that problem, it seems possible but messy. I think supporting normal completion seems more likely, but that's also very model-dependent, and so has the same issues as GPT4All.

NightMachinery commented 2 months ago

@ahyatt Thanks. Why are normal completions model dependent? Even chat-tuned models are internally run as next token predictors with a hardcoded prompt template, so the completion API should be possible for all local models, and some cloud models. The only model dependency is whether the cloud supports this API, but if it doesn't, it will give a 404 error anyway, so it could be effectively treated as if all models support the API.

For FIM, the ollama API above uses the underlying FIM API from llama.cpp, so that also might not be model-dependent. Of course, gpt-3.5-turbo-instruct does not run on llama.cpp, so the FIM API might need some model-dependency.

Does this elisp package use GPT4All under the hood?

ahyatt commented 2 months ago

My understanding, perhaps flawed: some models are these raw foundation models, which basically operate as completion models. But many models are chat models (most of the popular models are like this), tuned to give complete responses, not designed to be used for completion. The results would be strange; I don't think you'd get an actual completion. So to use a completion API you have to use it with only models that are designed to be used that way.

And about GPT4All, yes, it's one of the options in the llm package, so we should if possible design things to support this package as well. But it's the only package we support that gives no control to the caller of the llm package about what model is being used.

NightMachinery commented 2 months ago

@ahyatt Yes, chat-tuned models probably aren't that good for completion, but they can be used anyway. I don't think exposing the lower-level completion API (which can be used with an appropriate template to produce the higher-level chat responses) has any downsides at all. It allows more tinkering and extensibility, if anything.

Given a sufficiently long context, these chat-tuned models might even work as good completers. The chat RLHF lobotimizes the models, but when the context grows, the model returns back to its roots of next token prediction.

ahyatt commented 2 months ago

I think we'll have to see how it works in practice. I'll try to look into in the next few weeks, thanks for bringing it up!

NightMachinery commented 1 month ago

@ahyatt Mistral’s new 22B code model also supports infilling:

This is usable both locally and via the API.

ahyatt commented 1 month ago

Thanks for the information. I've been taking a look at this, and the support is pretty spotty. I think the basic problem is that although the ability to do completions is how the foundation models work, companies like Open AI can't release the completion functionality from them due to lack of trust/safety filtering on the outputs. It is available in a deprecated API, but I'm not sure that's something that we want to depend on.

AFAICT, here's the current state of completion functionality:

Open AI: Prefix & infill completion, but with legacy API. Claude: Prefix completion API, not legacy. Gemini: Nothing Ollama: Prefix completion API, not legacy. Actual ability to do completions as completions varies by model.

That's not bad, I think we could conceivably implement this. However, I still wonder if it is necessary. You can get the same functionality by using the existing chat functionality. For example:

(llm-chat ash/llm-openai (llm-make-chat-prompt "Return the contents of <INFILL>.  When I got home that morning, I noticed something was missing.  <INFILL>  That's when I knew I had been visited by clowns.  INFILL="))
"my collection of colorful balloons had vanished, and there was confetti scattered all over the floor."

I'd first want to know if it really is unworkeable to use it via the chat interface this way. It doesn't seem particularly clunky to do this.

NightMachinery commented 1 month ago

@ahyatt I actually have a chat-based implementation of infilling. It's slow, complex, heuristic-based, expensive, error-prone and nowhere near as good as GH Copilot. I have both of them bound to different hotkeys, and I almost always use GH Copilot. And GH Copilot is shit on my internet connection and often doesn't return any results. In those cases, I just give up. The chat-based solution is just not worth it, it's easier to type the code manually.

I am not saying it's impossible to work around the problem with more prompt engineering, but it's fighting an uphill battle for no real gain. We also need strong models for the chat based solution to work at all, while even a weak auto-regressive model should still be useful with the completion API.

ClosedAI et al. are indeed not likely to keep providing us with a completion API, and their current offerings aren't that good anyway. But open models should already be good enough. Specifically, this new Mistral Codestral should be quite good. DeepSeek Coder was also pretty good when I checked it out, though I only used the chat version manually.

NightMachinery commented 1 month ago

I have heard that CodeGemma 2B is also a good autocomplete model. This model is so small that it doesn’t even have a chat-tuned version.

https://huggingface.co/google/codegemma-2b-GGUF

ahyatt commented 1 month ago

I've started a completion branch, and made a basic ollama implementation. I've tried the CodeGemma model, but the quality wasn't very good.

I'm trying the following

(llm-complete-async (make-llm-ollama :completion-model "codegemma:latest")
                    "(defun fibonacci (n) " (lambda (x) (save-excursion (goto-char (point-max))
                                                                        (insert x)))
                    #'ignore)

Like I feared, the result is more chat-like than completion-like. It may be possible to fix this by choosing just the right model. Also, the completion API in Ollama has no way to add a suffix, unless possibly by playing around with the template, which is something I don't know enough about.

Please feel free to try it out and let me know if you can get a good quality answer out of it with various ollama models.

NightMachinery commented 1 month ago

Thanks.

Ollama's infilling API has not been merged.

I implemented Mistral's FIM API myself. Testing it, it's not too bad, but it's not very good. So I guess you're right for now. Mistral's FIM seems even worse than my hacked together chat-based solution when using GPT4O.

My biggest problem with my hacky chat-based solution is how it doesn't return the exact part asked for. E.g., when asking to complete:

prefix: print("Hel
suffix:)

it might return world"), while the correct answer is lo World".

Unfortunately, Codestral FIM API also has this problem.

I guess the Codestral API might be worth it becuse of cost, as they offer both pay-as-you-go and a monthly subscription for unlimited (with minutely rate limits) API usage.

ahyatt commented 1 month ago

I use copilot, and I notice it also has this exact problem. It's always trying to add an ending quote or paren when none is needed. It might be useful to have a multi-step system that can ask the LLM for followup, and that ability is something I am working on (slowly).

s-kostyaev commented 3 weeks ago

I think we can implement something like https://blog.jetbrains.com/blog/2024/04/04/full-line-code-completion-in-jetbrains-ides-all-you-need-to-know/ - postprocessing. E.g. we can delete some quotes and check if tree-sitter parsed tree is correct.

s-kostyaev commented 3 weeks ago

Also, new deepseek coder v2 lite also supports FIM

NightMachinery commented 3 weeks ago

Another problem of Codestral FIM that I forgot to mention is that if one doesn't stop at the first newline character, the model can return a lot of stuff without paying attention to how it all fits together with the suffix. They themselves suggest using \n\n as the stop sequence, but that wasn't usable for me. I ended up using \n, which severely limits the completions. In general, their model is nowhere as good as GH Copilot. (I.e., it's just much less smart.)

s-kostyaev commented 3 weeks ago

I bet Codestral is very good model, but GH Copilot has much better context for requests.

NightMachinery commented 2 weeks ago

@s-kostyaev Unlikely IMO. I use GH Copilot through emacs, which doesn't seem to do much magic. I also know exactly the context I am sending to Codestral, and the context is enough for a good completion, as models such as GPT4O show. Besides, Codestral is just bad at doing FIM irrespective of the code itself; Copilot really fills in the code and (mostly) respects parentheses etc., but Codestral feels like an autoregressive model that doesn't see the suffix and hasn't been trained to offer bite-sized completions. As I said, I have to stop Codestral on newline. If I allow it to generate multiple lines, it'll just output a big chunk of junk that doesn't integrate well with the given suffix.

OpenAI wants to prop up Copilot and is not releasing FIM, but Anthropic really should. Their Sonnet 3.5 is great.

s-kostyaev commented 2 weeks ago

@NightMachinery have you tried deepseek coder v2 lite for FIM?

It's fast and helpful, but I don't try it with FIM yet. It helps me create https://github.com/s-kostyaev/reranker in one day.

NightMachinery commented 2 weeks ago

@s-kostyaev No. But I just tried Sonnet 3.5 with this template, and it works great:

User: Fill in at `<COMPLETE_HERE>` in the following snippet. Only output A SINGLE CODE BLOCK without any other explanations and START your output FROM THE BEGINNING OF THE LINE MARKED FOR COMPLETION. Ensure correct indentation.

Query:

;;; night-openai.el --- -- lexical-binding: t; -- ;;; (defvar night/openai-key (z var-get "openai_api_key")) (defun night/openai-key-get () night/openai-key)

(defvar night/openrouter-key (z var-get "openrouter_api_key")) (defun night/openrouter-key-get () night/openrouter-key)

(defvar night/groq

;;; (provide 'night-openai) ;;; night-openai.el ends here


Answer (with the correct indentation, in a code block, started from the beginning of the line marked for completion):

[2024-06-26 12:57:29] [Open AI --> Emacs]: ```
-key (z var-get "groq_api_key"))
(defun night/groq-key-get ()
  night/groq-key)

s-kostyaev commented 2 weeks ago

Nice. Unfortunately I can't use it from my current location.

NightMachinery commented 2 weeks ago

@s-kostyaev I use it via OpenRouter, it works in any country.

https://openrouter.ai/models/anthropic/claude-3.5-sonnet:beta

ahyatt commented 1 week ago

On the subject of building up a reasonable context, you both may be interested in my latest branch, which I intend to integrate in soon (maybe tomorrow). You can see the documentation here:

https://github.com/ahyatt/llm/tree/prompt#advanced-prompt-creation

And the code is in the branch. This should make coding completions better by providing a framework to define prompts which can be filled with generators, in the coding case, these would be generators that add segments of the code in order of most relevant to least relevant.

Feedback is welcome!

On Wed, Jun 26, 2024 at 5:53 AM Feraidoon Mehri @.***> wrote:

@s-kostyaev https://github.com/s-kostyaev I use it via OpenRouter, it works in any country.

https://openrouter.ai/models/anthropic/claude-3.5-sonnet

— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2191284966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZB64DRKKKTUACURMZDZJKFSDAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJRGI4DIOJWGY . You are receiving this because you were mentioned.Message ID: @.***>

s-kostyaev commented 1 week ago

Nice! I need to integrate it into Elisa and probably ellama. How token counts will be calculated? I know paid api providers has this functionality, but ollama hasn't. If based on word count it will not work on code.

s-kostyaev commented 1 week ago

Also I don't understand if I can use this new facility to fill top K chunks. If I give 5 tickets to one generator will it get 5 values from this generator if context size not exceeded? If so, how it will be formatted? If not I need this for RAG functionality and we need to think how to add it.

ahyatt commented 1 week ago

We mostly calculate token count with an approximation - 3/4 of the word count (see llm-count-tokens). We do this quite a bit in this algorithm, so we want something cheap & fast. For Vertex, I implemented a way to calculate based on an API call, but I think I should probably remove that, since it would be too expensive for this algorithm. Plus, it's just hard to calculate tokens ahead of time, which is one of the reasons by default we just use 50% of the context window, because we know it might be inaccurate. I'll mention this in the docs.

About filling top K chunks, the tickets don't represent the number of drawings, but the chance of "winning" a particular drawing and thus getting to fill a variable slot in the template, once, in a repeated set of drawings. So if you just have 1 variable with 5 tickets, it really doesn't matter how many tickets you have, because it will win every drawing, and just keep drawing until the desired prompt size is reached, or you run out of values. If you want to only supply 5 values, that's something you'd have to enforce on your side via the list or generators you are passing in.

On Sun, Jul 7, 2024 at 1:54 AM Sergey Kostyaev @.***> wrote:

Also I don't understand if I can use this new facility to fill top K chunks. If I give 5 tickets to one generator will it get 5 values from this generator if context size not exceeded?

— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2212333456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZBOP34HLSD75SEZWODZLDJZFAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGMZTGNBVGY . You are receiving this because you were mentioned.Message ID: @.***>

NightMachinery commented 1 week ago

I like these context builders. I believe they should be optimized for interactive use for now; i.e., I want to manually chat with a model (e.g., on the Claude.ai website), I still need to provide some context to the chat.

Using it for automated IDE-like functionality such as FIM will probably not be worth it. The models have problems with basic syntax.

s-kostyaev commented 1 week ago

It means Tickets only represent probabilities. And in this implementation I need something like: {{top1:10}} {{top2:5}} {{top3:1}} {{top4:1}} ?

ahyatt commented 1 week ago

Yes, tickets are just probabilities (or, more accurately, numbers that are transformed into probabilities given the context of all the other variables and their tickets), I'd need to see a complete prompt template with those variables to make sense of what your intention is.

If you wanted to have the top5 results for a query, that's just one variable, and it would just be filled until the tokens exceed the limit. For example, if your prompt template was: "Answer the user's query given these surrounding code segments: {{segments}}", when filling this template, you could pass a generator that will yield each topk segment in order. It can stop at 5 elements, or whatever number you want, or not stop at all, in which case the llm-prompt library will stop when it exceeds the token limit.

On Sun, Jul 7, 2024 at 12:33 PM Sergey Kostyaev @.***> wrote:

It means Tickets only represent probabilities. And in this implementation I need something like: {{top1:10}} {{top2:5}} {{top3:1}} {{top4:1}} ?

— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/45#issuecomment-2212502215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE7ZGUDRKOFQHDJ3RVSTLZLFUUNAVCNFSM6AAAAABGKEGE7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGUYDEMRRGU . You are receiving this because you were mentioned.Message ID: @.***>

s-kostyaev commented 4 hours ago

https://github.com/ollama/ollama/pull/5207 merged. In new ollama release will be FIM support.

ahyatt / llm

[Q/FR] "Legacy" Completion #45