Closed bibekyess closed 1 year ago
What client are you using?
@aarnphm I am sending post request to /v1/generate
endpoint.
What do you expect the result to be? I think we haven't support eos_token_id yet.
Not really a bug I would say.
/v1/generate
will just return the json. I think for both pytorch, vllm we can support parsing eos_token_id
btw /v1/generate
doesn't use any default or prompt template. Users have full control of how the prompt would look like.
Right now, generation logics are relatively heuristic. We support greedy decoding atm.
Can you send me what you did for model.generate
?
Our strategy right now for pytorch are similar to greedy decoding
In huggingface implementation, it looks like this:
tokenized = tokenizer(
input_query,
return_tensors='pt',
return_token_type_ids=False
).to("cuda:0")
response = model.generate(
**tokenized,
min_new_tokens=20,
max_new_tokens=300,
top_p=0.98,
temperature=0.9,
eos_token_id=6,
)
return tokenizer.decode(response[0])[len(input_query):].strip().strip("#")
It is very interesting. I wanted the output to be similar to the one with the above script. For instance: it was finetuned for title generation, and using the above script, the response is concise and good even with do_sample
as True
. But with openllm, I get a very long response. Decreasing the max_new_tokens generates a response which is not good. For other generation tasks like body generation, the response quality is not the same as the huggingface ones. It involves some special characters of the given inputprompts like '*#' and deteriorates the response.
model.generate
here is a very magical function. it does a lot of different generation strategies under the hood. I think by default, it should be similar.
I will think about this a bit more. Since vllm seems to also support beam search now. Usually with regards to generation in openllm pytorch, I will just iterate til the max_new_tokens is reached, optionally when stop
is passed, we will do check from the detokenized string to stop it.
I think I can support eos_token_id back to the generation for openllm pytorch.
I see! Thank you @aarnphm for your comments. I didn't know before that eos_token_id
parsing was not supported.
After adding the corresponding token to stop
. It is performing well and the output is similar to the one with model.generate
. Thank you for your help! :)
btw https://github.com/bentoml/OpenLLM/pull/714 I added support for eos_token_id here cc @bibekyess
Hello, I am using OpenLLM for serving Korean Polyglot models. I want to utilize the hot-swapping feature of OpenLLM so that I can load multiple LORA adapters based on the request. But I am facing an output quality issue. Specifically, the output I get from Huggingface-transformers and the output from the OpenLLM is different. Can I know what may be the issue from your experience? I changed a lot of things inside the openllm package and I think maybe the issue is with custom template usage. I also went to
configuration_gpt_neox.py
and changed to not use the default template. But still the output quality is not good. Can you give some information on how to use the custom template if this is the reason for different output?Note: I converted the prompts/outputs to english for easier understanding. I want to use a prompt template like this:
The input looks like this:
The output from Huggingface-Transformers:
The output from OpenLLM:
My command to start the server is like this:
I checked with multiple inputs on other adapters too and the outputs are different with the vanilla huggingface implementation in all cases.
I would really appreciate your help. Thank you! :)