[vllm] - - Githubissues

起始日期 | Start Date

9/3/2024

实现PR | Implementation PR

No response

摘要 | Summary

When using vLLM to optimally utilize GPU space for faster inference and generation, there is a noticeable degradation in output quality compared to the original model. This issue aims to address the quality drop and find ways to match the original model's performance while maintaining the speed improvements.

基本示例 | Basic Example

Not complete for example Screenshot 2024-09-03 at 16 39 52

缺陷 | Drawbacks

Current optimization leads to decreased output quality Users may have to choose between speed and quality, which is not ideal Potential increased complexity in configuration to balance speed and quality

未解决问题 | Unresolved questions

What specific aspects of the optimization are causing the quality degradation?
Are there any configuration parameters that can be tuned to improve quality without sacrificing speed?
Is it possible to implement a dynamic system that adjusts optimization based on the specific task or required quality level?
How can we quantify and measure the quality degradation to better address the issue?
Are there any alternative optimization techniques that could provide better quality-speed balance?

OpenBMB / MiniCPM-V

[vllm] - #555