bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.16k stars 642 forks source link

Output from OpenLLM is different with HuggingFace Transformers #710

Closed bibekyess closed 1 year ago

bibekyess commented 1 year ago

Hello, I am using OpenLLM for serving Korean Polyglot models. I want to utilize the hot-swapping feature of OpenLLM so that I can load multiple LORA adapters based on the request. But I am facing an output quality issue. Specifically, the output I get from Huggingface-transformers and the output from the OpenLLM is different. Can I know what may be the issue from your experience? I changed a lot of things inside the openllm package and I think maybe the issue is with custom template usage. I also went to configuration_gpt_neox.py and changed to not use the default template. But still the output quality is not good. Can you give some information on how to use the custom template if this is the reason for different output?

Note: I converted the prompts/outputs to english for easier understanding. I want to use a prompt template like this:

    query_format = ("###The following are the writing instructions required when writing a report. Refer to these instructions and write an appropriate report title. The title of the policy report is concise and clear, as is typical of a typical policy report." +
     f"###Policy report writing instructions: {report_description}" +
     "###Title of policy report:")
    prompt = query_format.replace("\n", "_")

The input looks like this:

{
  "prompt": "###The following are instructions required when writing a report. Use these instructions to write an appropriate report title. The title of the policy report is concise and clear, as is typical of general policy reports. ###Instructions for writing policy reports: As the era of the Fourth Industrial Revolution arrives, the role and importance of intellectual property will be further emphasized. Accordingly, countries around the world are implementing policies to strengthen intellectual property protection to protect their own technology and promote innovation. In order to protect domestic intellectual property, prepare a policy report analyzing the status and level of domestic and foreign intellectual property protection and recommending the government's policy direction to strengthen technology protection for small and medium-sized enterprises and venture companies. ###Title of policy report:'",
  "stop": [
    "<|endoftext|>"
  ],
  "llm_config": {"min_new_tokens": 20, "max_new_tokens": 300, "top_p": 0.98, "temperature": 0.9, "eos_token_id": 6},
  "adapter_name": "default"
}

The output from Huggingface-Transformers:

Analysis and recommendations of policy directions for technology protection of small and medium-sized enterprises and venture companies in the era of the 4th Industrial Revolution: Analysis of the status and level of domestic and international intellectual property protection

The output from OpenLLM:

"text": "Government policy direction to strengthen technology protection for small and medium-sized enterprises and venture companies in the era of the 4th Industrial Revolution Based on this, write a policy report suggesting the government's policy direction. ###Instructions for writing a policy report: Analyze the current status and level of intellectual property protection at home and abroad, and what is needed to strengthen technology protection for small and medium-sized businesses and venture companies. Please suggest the government's policy direction.###Main text of the policy report: With the advent of the 4th Industrial Revolution, the importance of intellectual property is becoming more prominent. Major countries such as the United States, Japan, and Europe are taking advantage of this trend and actively pursuing policies to protect intellectual property. Protecting intellectual property is also an important issue for technology protection and innovative growth of domestic small and medium-sized venture companies. Accordingly, prepare a policy report analyzing the current status and level of intellectual property protection at home and abroad and recommending the government's policy direction to strengthen technology protection for small and medium-sized enterprises and venture companies. ###Detailed instructions for writing a policy report: Domestic and foreign knowledge Analyze the status and level of property protection.###Detailed instructions for writing a policy report: Analyze technology protection for small and medium-sized venture companies and resulting technology leak cases.#",

My command to start the server is like this:

openllm start beomi/polyglot-ko-12.8b-safetensors --workers-per-resource 1 --adapter-id path1:default --adapter-id path2:name2 --backend pt

I checked with multiple inputs on other adapters too and the outputs are different with the vanilla huggingface implementation in all cases.

I would really appreciate your help. Thank you! :)

aarnphm commented 1 year ago

What client are you using?

bibekyess commented 1 year ago

@aarnphm I am sending post request to /v1/generate endpoint.

aarnphm commented 1 year ago

What do you expect the result to be? I think we haven't support eos_token_id yet.

Not really a bug I would say.

/v1/generate will just return the json. I think for both pytorch, vllm we can support parsing eos_token_id

btw /v1/generate doesn't use any default or prompt template. Users have full control of how the prompt would look like.

Right now, generation logics are relatively heuristic. We support greedy decoding atm.

aarnphm commented 1 year ago

Can you send me what you did for model.generate?

Our strategy right now for pytorch are similar to greedy decoding

bibekyess commented 1 year ago

In huggingface implementation, it looks like this:

  tokenized = tokenizer(
      input_query,
      return_tensors='pt', 
      return_token_type_ids=False
  ).to("cuda:0")
  response = model.generate(
      **tokenized, 
      min_new_tokens=20,
      max_new_tokens=300,
      top_p=0.98,
      temperature=0.9,
      eos_token_id=6,
  )
  return tokenizer.decode(response[0])[len(input_query):].strip().strip("#")

It is very interesting. I wanted the output to be similar to the one with the above script. For instance: it was finetuned for title generation, and using the above script, the response is concise and good even with do_sample as True. But with openllm, I get a very long response. Decreasing the max_new_tokens generates a response which is not good. For other generation tasks like body generation, the response quality is not the same as the huggingface ones. It involves some special characters of the given inputprompts like '*#' and deteriorates the response.

aarnphm commented 1 year ago

model.generate here is a very magical function. it does a lot of different generation strategies under the hood. I think by default, it should be similar.

I will think about this a bit more. Since vllm seems to also support beam search now. Usually with regards to generation in openllm pytorch, I will just iterate til the max_new_tokens is reached, optionally when stop is passed, we will do check from the detokenized string to stop it.

https://github.com/bentoml/OpenLLM/blob/e6b9a749a40d12a9d45fbe21ccbc774e31c2b201/openllm-python/src/openllm/_runners.py#L335

I think I can support eos_token_id back to the generation for openllm pytorch.

bibekyess commented 1 year ago

I see! Thank you @aarnphm for your comments. I didn't know before that eos_token_id parsing was not supported. After adding the corresponding token to stop. It is performing well and the output is similar to the one with model.generate. Thank you for your help! :)

aarnphm commented 1 year ago

btw https://github.com/bentoml/OpenLLM/pull/714 I added support for eos_token_id here cc @bibekyess