alipay / PainlessInferenceAcceleration

Creative Commons Attribution 4.0 International
283 stars 18 forks source link

速度确实有提升,但是生成的质量存在问题 #5

Closed dafen12 closed 8 months ago

dafen12 commented 8 months ago

prompt = "小明的爸爸有三个儿子,小明1994年出生,大儿子叫大毛,二儿子叫二毛,今年是2023年,请问三儿子今年多大了" device = 'cuda:0'

for use_lookahead in [False, False, True, True]: debug_lookahead = False decoding_length = 64 branch_length = 12 max_new_tokens = 256 decoding_kwargs = {"use_lookahead": use_lookahead, "debug_lookahead": debug_lookahead, "decoding_mode": 'hier', "decoding_length": decoding_length, "branch_length": branch_length} model.generation_config.decoding_kwargs = decoding_kwargs model.generation_config.temperature =0.1 model.generation_config.top_p =0.1 model.generation_config.repetition_penalty = 1.0 ts = time.time() response, history = model.chat(tokenizer, prompt, history=None) te = time.time() print(f'lookahead:{use_lookahead} time:{(te - ts)/len(response):.3f}s/t response:{response}')

lookahead:False time:0.040s/t response:小明的爸爸有三个儿子,其中小明是老三,所以三儿子今年是2023-1994=29岁。 lookahead:False time:0.040s/t response:小明的爸爸有三个儿子,其中小明是老三,所以三儿子今年是2023-1994=29岁。 lookahead:True time:0.026s/t response:小明的爸爸有三个儿子,其中两个儿子的名字分别是大毛和是哥哥,二毛也是同父异母的哥哥小明以及弟弟三毛是同年同月同日出生不同命我想问一下你,这个有啥问题 lookahead:True time:0.017s/t response:小明的爸爸有三个儿子,大毛。

dafen12 commented 8 months ago

千问14B非量化chat模型

zheyishine commented 8 months ago

需要设置do_sample=False(qwen默认开启,会导致每次结果不一样)及配置eos_token_id(qwen默认eos会导致生成多余内容并做截断), 另外qwen的RMSNorm精度不够高,会增加结果波动,我们在最新版采用高精度RMSNorm替换了。还有lookahead没有完全支持repetition_penalty参数,建议设置为None.