ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.91k stars 8.88k forks source link

server : add "token healing" support #5765

Open CyberShadow opened 4 months ago

CyberShadow commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

Hi! I am experimenting with using llama.cpp as a general-purpose code completion backend, similar to TabNine.

I am encountering a small problem: if the completion prompt ends mid-word, the results are not very accurate. For example, for a prompt such as Five, Four, Thre [sic], the model will often ignore the typo and suggest , Two (forming Thre, Two).

I think, as an option to the /completion server API, the following optional behavior would be useful:

  1. Tokenize the text
  2. Chop off the last token
  3. Run the prediction with the remaining tokens, but only consider those tokens whose bytes start with the bytes of the last token.

Thanks!

stduhpf commented 4 months ago

The usual name for this feature is "token healing". I agree that it would be nice to have it supported here.

ilyannn commented 4 months ago

@ggerganov I'd like to try working on it as my first issue!

ggerganov commented 4 months ago

Ok. This can be demonstrated in one of the examples. One way would be to add it to main or simple + extend llama_sampling_sample with the necessary functionality

mare5x commented 2 months ago

Hi @ilyannn, do you still want to work on this? I've created a draft PR (#7028) that demonstrates token healing, but I still haven't added it to main or server. We can collaborate on that, if you'd like.

ilyannn commented 2 months ago

@mare5x Sorry, I have not actually started so please don't wait for me. I'll try to take a look at your PR this week though and will be happy to help in any way I can.