feat: Add max_decoding_tokens as an optional field in CompletionRequest

TabbyML / tabby

Self-hosted AI coding assistant

https://tabby.tabbyml.com/

Other

21.31k stars 959 forks source link

feat: Add max_decoding_tokens as an optional field in CompletionRequest #2412

Closed moqimoqidea closed 3 months ago

moqimoqidea commented 3 months ago

Allows users to control the maximum number of tokens to decode, thereby controlling the length of the model output. If the user does not provide this field, the default value of 64 is used.

wsxiaoys commented 3 months ago

Since we do not have a way to control num tokens in extension, how do you plan to fill the field?

moqimoqidea commented 3 months ago

Since we do not have a way to control num tokens in extension, how do you plan to fill the field?

In my use case, I've enriched the agent with max_decoding_tokens code, and the agent has also added some other logic to determine whether the user needs complete method completion, such as checking if it's a method comment in a certain language, etc. I'm thinking that perhaps the first step could be for the server-side to support this parameter? This could potentially improve the flexibility and adaptability of our code completion service, allowing it to better cater to different user needs and scenarios.

wsxiaoys commented 3 months ago

gotcha - in this case,I'll prefer:

introducing an environment variable TABBY_EXPERIMENTAL_MAX_CODE_COMPLETION_LENGTH, default to be 64.
use it as max_decoding_length in route.

moqimoqidea commented 3 months ago

In the first solution, I understand that this operation is purely a server-side configuration, which means that it might not take into account dynamic strategies from the client side. If indeed there is no need for dynamic max_decoding_tokens for a period of time, I believe it can meet the requirements.

As for the second solution, I don't quite understand how it works on an operational level. Could you explain the difference between the route and the current implementation?

wsxiaoys commented 3 months ago

The major thing is I'm still hesitating whether we wanna move the decision of max_decoding_tokens to client side - adding it to CompletionRequest will make it part of v1 api - leaving us the burden to maintain it in future.

Making it a static environment variable is much lighter from a maintainance perspective of view

moqimoqidea commented 3 months ago

I completely understand your point.

Indeed, static environment variables are the easiest to maintain.

If we decide to shift to the client side, our benefit would be pairing certain strategies that some users can take advantage of. As in the example I gave, when users write method comments to complete code, it usually means they expect more comprehensive code completion. If such strategies are not numerous or do not cover many user cases, then it wouldn't be worthwhile from any perspective. Conversely, if they do cover enough cases, it can be considered. Therefore, I think data can support our decision-making, such as identifying user cases or researching coverage rates.

This is not urgent and can be decided after thorough discussion.

moqimoqidea commented 3 months ago

One more point I would like to add:

Another factor influencing the decision is whether users incur additional GPU costs if configured solely on the server side. For example, if I have a very good GPU and I set this parameter to 1024, it means that any request from the agent will use this max_decoding_tokens configuration.

If the agent configures this parameter, a good strategy might allow only 10% of the requests to carry a max_decoding_tokens of 1024. In some cases, users can increase the parallelism by one when starting the Tabby server to improve service concurrency.

I'm not sure if I understand parallelism correctly this way. :)