Open forrestjgq opened 9 months ago
What issue do you want to mention? If you don't setup the end id or stop word, or the model does not really generate the end id, it will generate text until achieve the max tokens you setup.
@byshiue I think I know why this happens: see official tokenizer: /root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py:1031
inputs = tokenizer.build_chat_input(query, history=history, role=role)
inputs = inputs.to(self.device)
>>> eos_token_id = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"),
tokenizer.get_command("<|observation|>")]
outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
which means it consider the eos/user cmd as end id, I think that's why extra <|user|>.....
pops out
I tried 2 tests:
first, I add stop_words_list
in decoding:
user = tokenizer.get_command("<|user|>")
print(f'user {user}')
stopids = torch.Tensor([user]).int().cuda()
output = decoder.decode(
input_ids.contiguous().cuda(),
input_lengths.contiguous().cuda(),
sampling_config,
output_sequence_lengths=True,
return_dict=True,
stop_words_list=stopids
)
the <|user|>
token value is 64795, and the output is empty
also, I tried in triton, it does not work either:
023-12-08 02:03:02.490 [INFO] send triton msg:
{
"text_input": "\u003c|system|\u003e你是一位智能AI助手\u003c|user|\u003e谁赢了2020年世界杯?\u003c|assistant|\u003e法国队赢了\u003c|user|\u003e这次是哪里举办的?\u003c|assistant|\u003e",
"max_tokens": 1000,
"bad_words": "",
"stop_words": "\u003c|user|\u003e",
"end_id": 2,
"top_p": 1,
"temperature": 1,
"presence_penalty": 0
}
2023-12-08 02:03:03.368 [INFO] Trition response msg:
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"这次世界杯是在卡塔尔举办的。<|user|>法国队的队员们都来自哪些国家?<|assistant|>法国队的队员来自各个国家,但主要来自欧洲。"}
BTW, I tried same test with huggingface, it works as expected, the test code:
from transformers import AutoTokenizer, AutoModel
path = "/home/gqjiang/tmpfs1/chatglm3-6b"
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModel.from_pretrained(path, trust_remote_code=True, device='cuda')
model = model.eval()
qs = [
"谁赢了2018年世界杯?",
"这次是哪里举办的?"
]
for q in qs:
response, history = model.chat(tokenizer, q, history=[])
print(f"Q:\n{q}")
print(f"A:\n{response}")
print(f"History:\n{history}")
print("\n===================================================\n")
Now would you please guide me, how to fix this issue?
You could check the output ids you get and the stop words.
Hi: I'm trying to run chatglm3-6b as chat model too, and the end of the output is below:
Exception ignored in: <function _Runtime.del at 0x7fedb8b29d80> Traceback (most recent call last): File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 235, in del TypeError: 'NoneType' object is not callable
I want to ask why, and how to deal with, Thank you!
Hi: I'm trying to run chatglm3-6b as chat model too, and the end of the output is below:
Exception ignored in: <function _Runtime.del at 0x7fedb8b29d80> Traceback (most recent call last): File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 235, in del TypeError: 'NoneType' object is not callable
I want to ask why, and how to deal with, Thank you!
Please share the reproduced steps to reproduce your issue, thank you for cooperation.
Hi there:
I'm trying to run chatglm3-6b as chat model, here is the conversation:
You see, that the answer is expected to be:
2020年世界杯在卡塔尔举办。
, but much more extra conversations are generated by model.I also tried in triton, same issue happens, even I've set the end_id to be
2
:Interesting thing is, the model's responses are different, I don't know why is that