Unable to get completion with llama cpp

Hello, I am trying to configure lsp-ai to get copilot-like completion in helix. I intend to use only models running locally. I ideally would like to get them served following an OpenAI compatible API, Llama.cpp provides this and so does mlx_lm server that I would like to use.

The problem is that I end up with a 400 error from the server that seems to receive a request under a wrong format. This happens both with llama cpp and mlx server so it doesn't seem to be server-related, but a problem with lsp-ai. Is there a way to monitor the requests sent back/forth?

There's a second option that I have tried, which is to use the direct llama_cpp feature from lsp-ai (how does it work? does it spawn it's own separate instance of the server? What if we have one running already on the same ports, what about memory usage by an additional server if the models are big, compared to just linking to a running one through the openAI-compatible api?) Using the internal llama_cpp feature, it seems to send requests properly, at least the helix logs don't show any error, but there is no completion displayed at all. Instead, here is what I am getting: (example on the right, note that the configuration shows on the left) ai - text does nothing when selected. Has anybody got any luck with this kind of configuration?

Edit Oct 12th I have tried with Ollama following this discussion and got the same issue as with llama.cpp. Besides, I tried to use visual studio with a similar configuration and got the same behavior. It seems to be the way the request is sent that is problematic as it doesn't seem to be built for completion (in that there is not really any next word to predict for the given prompt) but for chat completion, although it follows the completion standard format. For example: enabling verbose mode on llama.cpp, I can see the request sent being:

request: POST /completions 127.0.0.1 200 request: {"echo":false,"frequency_penalty":0.0,"max_tokens":64,"model":"Qwen2.5-Coder-7B-Instruct-Q8_0","n":1,"presence_penalty":0.0,"prompt":"\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n","temperature":0.10000000149011612,"top_p":0.949999988079071}

and the response (note the empty content)

response: {"content":"","id_slot":0,"stop":true,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":77,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","seed":4294967295,"seed_cur":979658403,"temperature":0.10000000149011612,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":77,"timings":{"prompt_n":77,"prompt_ms":378.821,"prompt_per_token_ms":4.919753246753247,"prompt_per_second":203.2622267508929,"predicted_n":1,"predicted_ms":0.009,"predicted_per_token_ms":0.009,"predicted_per_second":111111.11111111112},"index":0}

The same request also returns an empty response content with curl, while, if sending a request (with the same prompt) under the chat completion format with curl:

request: POST /v1/chat/completions 127.0.0.1 200 request: { "messages": [{"role": "user", "content": "\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n"}], "temperature": 0.7 }

The response seems to be spot on:

response: {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2] # Choose the middle element as the pivot\n left = [x for x in arr if x < pivot] # Elements less than the pivot\n middle = [x for x in arr if x == pivot] # Elements equal to the pivot\n right = [x for x in arr if x > pivot] # Elements greater than the pivot\n return quicksort(left) + middle + quicksort(right)\n","role":"assistant"}}],"created":1728736275,"model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":123,"prompt_tokens":96,"total_tokens":219},"id":"chatcmpl-QXL7OiScMeo2SyWwAMdjR5DrWTXcqGBg","__verbose":{"content":"python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2] # Choose the middle element as the pivot\n left = [x for x in arr if x < pivot] # Elements less than the pivot\n middle = [x for x in arr if x == pivot] # Elements equal to the pivot\n right = [x for x in arr if x > pivot] # Elements greater than the pivot\n return quicksort(left) + middle + quicksort(right)\n","id_slot":0,"stop":true,"model":"gpt-3.5-turbo-0613","tokens_predicted":123,"tokens_evaluated":96,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","seed":4294967295,"seed_cur":2759634986,"temperature":0.699999988079071,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"<|im_start|>user\n\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n<|im_end|>\n<|im_start|>assistant\n","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":218,"timings":{"prompt_n":96,"prompt_ms":382.361,"prompt_per_token_ms":3.9829270833333332,"prompt_per_second":251.07163125946423,"predicted_n":123,"predicted_ms":3248.555,"predicted_per_token_ms":26.4110162601626,"predicted_per_second":37.86298831326543},"index":0,"oaicompat_token_ctr":123}} srv addwaiting: add task 142 to waiting list. current waiting = 0 (before add)

It could explain why it also fails with mlx and other OpenAI compatible API servers that all follow the same format. Haven't been able to investigate why the direct llama.cpp type fails too, since I cannot control it's logs and launch it in verbose mode, but I suspect the issue to be the same.

SilasMarvin / lsp-ai

Unable to get completion with llama cpp #81