BUG for chatglm3-6b and Qwen-14B-Int4

alipay / PainlessInferenceAcceleration

Creative Commons Attribution 4.0 International

283 stars 18 forks source link

BUG for chatglm3-6b and Qwen-14B-Int4 #7

Closed AGI-Jarvis closed 8 months ago

AGI-Jarvis commented 8 months ago

chenliangjyj commented 8 months ago

For chatglm3-6b , we just upload a test script. https://github.com/alipay/PainlessInferenceAcceleration/blob/main/pia/lookahead/examples/chatglm3_example.py
Please turn off the sample strategy when you test, the first two results without lookahead seems different. We only support greedy search for now.

AGI-Jarvis commented 8 months ago

For chatglm3-6b , we just upload a test script. https://github.com/alipay/PainlessInferenceAcceleration/blob/main/pia/lookahead/examples/chatglm3_example.py

Please turn off the sample strategy when you test, the first two results without lookahead seems different. We only support greedy search for now.

Qwen-14B-Int4已设置贪婪解码，lookahead=True时测试了四次输出，从图上的结果看来可能是因为重复惩罚还没支持造成的？lookahead=True时的token/s的数据看起来有点奇怪

chatglm3更新后已正常运行，但好像输出格式还有点小毛病，并且会有中英混杂输出和输出一半就停止的情况。从测试来看lookahead在首次运行的时候速度是会比较慢是吧？第二次开始才会有加速效果？不过项目加速效果确实很棒！期待Qwen支持重复惩罚后的效果！

AGI-Jarvis commented 8 months ago

不过在检索增强生成的场景中，如果prompt发生微小的改变就又会回到慢的速度了，例如“编一个200字左右的儿童故事”和“写一个200字左右的儿童故事”

AGI-Jarvis commented 8 months ago

此外我还发现，chatglm3的重复惩罚设置的是1.0，官方设置的是1.1，是1.0会更适合我们的框架吗？

chenliangjyj commented 8 months ago

1.目前repetition_penalty还没有支持，建议设置为None 2.lookahead首次运行相当于在调整生成概率，所以加速比不高。对于prompt发生小改动的问题，在线使用场景，通常服务是挂起的，所以概率可以在内存中实时更新，如果生成跟之前query相似的回答应该是会快的。

chatglm那个max_new_token有设限吗非常感谢试用我们的框架

AGI-Jarvis commented 8 months ago

1.目前repetition_penalty还没有支持，建议设置为None 2.lookahead首次运行相当于在调整生成概率，所以加速比不高。对于prompt发生小改动的问题，在线使用场景，通常服务是挂起的，所以概率可以在内存中实时更新，如果生成跟之前query相似的回答应该是会快的。 3. chatglm那个max_new_token有设限吗非常感谢试用我们的框架

1、我在qwen的example中看到了重复惩罚尚未支持，在glm3的example中看到默认设置的1.0 2、如图，差一个字都不行，或许可以考虑加个bert来改善这个问题？毕竟大部分检索增强的场景中每次的输出和输出还是差异很大的。此外如果一直挂起，问答多了之后内存会不会炸了呀？虽然我测试了几十条好像暂时没啥事 3、已解决，眼瞎了没注意，还以为是特性

zheyishine commented 8 months ago

最新版本里面重复惩罚已经支持了。这个参数设置1.0和None是等价的。
qwen_example主要是是用来测试一致性的，无法准确反应性能差异。性能测试可以参考benchmarks目录的测试脚本，通过大量不重复的请求来进行测试。问答多了不会炸，内部缓存的不是请求，而是基于token的n-gram, 有汰换机制。