Incomplete Translation from English to Chinese although `max_tokens` is enough

DeyangKong commented 8 months ago

I set up this model and I am running it as server using vllm. Here is the command.

python -m vllm.entrypoints.openai.api_server --model "/root/autodl-tmp/kdy/models/ALMA-13B-R" --served-model-name "ALMA" --tensor-parallel-size 2 --port 8000

Then I want to send a request to the server. I use beam search. Here is my code.

import requests
import json

url = "http://localhost:8000/v1/completions"

sentence = "The medical conditions that are targeted by the ongoing clinical trials of peptide-based drugs reflect the combined immunostimulatory and regulatory properties of this class of compound. These conditions include bacterial infections, such as infections with antibiotic-resistant pathogens, and inflammatory disorders, such as endotoxaemia and sepsis."

data = {
    "model": "ALMA",
    "prompt": "Translate this from English to Chinese:\nEnglish: " + sentence + "\nChinese:",
    "use_beam_search": True,
    "best_of": 5,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 2048,
}

response = requests.post(url, headers={"Content-Type": "application/json"}, json=data)
print(json.loads(response.text)["choices"][0]["text"])

And the output is as followed, which is incomplete. However the max_new_tokens is 2048, which is definitely enough.

(kdy) root@autodl-container-a814498486-f8ca8ba7:~# /root/miniconda3/envs/kdy/bin/python /root/autodl-tmp/kdy/translate_llm/use_request.py 目前正在进行的肽类药物

fe1ixxu commented 8 months ago

Thanks for testing our model!

I tested the model on my local machine via the code:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-13B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-13B-R", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from English to Chinese:\nEnglish: The medical conditions that are targeted by the ongoing clinical trials of peptide-based drugs reflect the combined immunostimulatory and regulatory properties of this class of compound. These conditions include bacterial infections, such as infections with antibiotic-resistant pathogens, and inflammatory disorders, such as endotoxaemia and sepsis.\nChinese:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=400, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

I got the completed translation output:

['Translate this from English to Chinese:\nEnglish: The medical conditions that are targeted by the ongoing clinical trials of peptide-based drugs reflect the combined immunostimulatory and regulatory properties of this class of compound. These conditions include bacterial infections, such as infections with antibiotic-resistant pathogens, and inflammatory disorders, such as endotoxaemia and sepsis.\nChinese: 目前正在进行的肽类药物临床试验针对的疾病类型反映了这类化合物具有联合免疫调节和抑制作用的特点。这些疾病包括抗生素耐药性细菌感染、炎症性疾病如细菌毒素血症和败血症等。']

Any possibility that the bug is from vllm side? May I know what version of ALMA did you use? I only see model="ALMA" (FYI. I am not familiar with vllm. Happy to see if you have some insights)

Another difference is our generation config is different (my temperature=0.6, top_p=0.9), but I don't think this is the problem.

DeyangKong commented 8 months ago

Thank you very much. The version of the model I used is "ALMA-13B-R". I tested the model via your code and it gave a complete output as yours. So the bug is due to vllm. Then I checked some information and I found using vllm will make the inference ability of the model lower, which is the source of the problem.

However, the speed of using transformers and model.generate method to inference is too slow, I have to use some toolkits or frameworks for deploying and serving Large Language Models to make its speed faster, such as vllm or tgi. Do you have some suggestions? I would appreciate it if you could give me some help.

My English is not good, please forgive me. Today is the Chinese XiaoNian in Lunar Calendar, wishing you a Happy New Year.

fe1ixxu commented 7 months ago

Happy New year to you too!

Something in my mind that may help you to speed up the inference:

Reduce the beam size to 1 (greedy decoding)
Try quantified version of ALMA-R (some people have done this: e.g., https://huggingface.co/DataSoul/ALMA-7B-R-Q3KM-gguf)
reduce max_new_tokens if you are sure that your source sentences are not long.

Hope this information is helpful!

fe1ixxu / ALMA

Incomplete Translation from English to Chinese although `max_tokens` is enough #24