Open wahaha22 opened 5 months ago
你好,感谢你对我们项目的关注。 目前PowerInfer开源的代码针对的场景是 模型超出显存容量的场景,而对于模型完全能放进显存里的场景,此时计算应该都在GPU上,理论上PowerInfer框架也能够提供1.5倍-2倍的加速比(这个结果应该与DejaVu的论文对齐),但目前我们没有针对完全offload到GPU的场景进行计算图的设计以及算子的实现。因此,我的建议是你可以尝试跑一下Falcon-40B模型或者llama-70B模型的不量化版本。 未来,我们计划支持,即使模型能够完全放进GPU的情况下的加速。
Hello, thank you for your interest in our project. Currently, the open-source code of PowerInfer is designed for scenarios where the model exceeds the capacity of the GPU memory. For scenarios where the model can entirely fit within the GPU memory, calculations would take place on the GPU. In theory, the PowerInfer framework should also provide a 1.5x to 2x acceleration, which should align with the results in the DejaVu paper. However, at present, we have not designed the computational graph or implemented operators for scenarios where the computation is completely offloaded to the GPU. Therefore, my suggestion is for you to try running the non-quantized versions of the Falcon-40B model or the llama-70B model. In the future, we plan to introduce acceleration support even when the model can entirely fit within the GPU.
感谢回复~ "理论上PowerInfer框架也能够提供1.5倍-2倍的加速比(这个结果应该与DejaVu的论文对齐),但目前我们没有针对完全offload到GPU的场景进行计算图的设计以及算子的实现" 我理解的理论上的1.5-2倍加速比是应该是 predictor 带来的, 但是在我的实验过程中没有出现, 我想知道是否是如下的原因:
你的第二点猜测是对的。目前的限制来源于我们的混合推理实现,在权重可以完全offload到GPU的情况下,依然存在大量不必要的CPU-GPU同步点。我们会在很近期解决这个问题,实现纯GPU推理。请关注我们的进展💪
Your second guess is correct. The current limitation stems from our hybrid inference implementation, where there are still numerous unnecessary CPU-GPU synchronization points, even when weights can be fully offloaded to the GPU. We are planning to resolve this issue in the near future to achieve pure GPU inference. Please stay tuned for our updates 💪
作者你们好, 感谢你们的耐心回复, 我有几个问题想进一步探讨下:
完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在llama.cpp/llm_build_ffn_sparse
函数。我们会很快着手进行这项工作。此外,当我们不需要考虑CPU-GPU混合运算时,在底层的GPU算子上,也可以类似地提供一个快速路径,代码在ggml-cuda.cu
中。
Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。
The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide a fast path that removes elements related to CPU-GPU hybrid computation, such as the GPU index and GPU buckets, which are unnecessary in this context. This part of the code is in the llama.cpp/llm_build_ffn_sparse
function. We will soon start working on this. Additionally, when CPU-GPU hybrid computation is not a consideration, we can similarly provide a fast path at the lower GPU operator level, which is located in ggml-cuda.cu
.
The sparsity of the Attention layer varies significantly across different models and has not shown to be significant in the models we currently support. Therefore, we do not plan to support it in this open-source code. You can refer to the discussion in issue #111.
完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在
llama.cpp/llm_build_ffn_sparse
函数。我们会很快着手进行这项工作。此外,当我们不需要考虑CPU-GPU混合运算时,在底层的GPU算子上,也可以类似地提供一个快速路径,代码在ggml-cuda.cu
中。Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。
The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide a fast path that removes elements related to CPU-GPU hybrid computation, such as the GPU index and GPU buckets, which are unnecessary in this context. This part of the code is in the
llama.cpp/llm_build_ffn_sparse
function. We will soon start working on this. Additionally, when CPU-GPU hybrid computation is not a consideration, we can similarly provide a fast path at the lower GPU operator level, which is located inggml-cuda.cu
.The sparsity of the Attention layer varies significantly across different models and has not shown to be significant in the models we currently support. Therefore, we do not plan to support it in this open-source code. You can refer to the discussion in issue #111.
你好,请问一下预计多久会完成全gpu推理的开发 @hodlen
我们目前刚刚着手开发,预计需要一至两周时间来完成开发和性能测试。
你好,请问一下预计多久会完成全gpu推理的开发 @hodlen
@hodlen 辛苦问下全gpu推理开发的如何了?有预期的发布时间吗?
同等待
Hello 作者你好, 我正在复现 PowerInfer, 对比 llama.cpp 和 PowerInfer 的性能。在基准测试阶段我遇到了一些意料之外的结果。
环境
代码
操作系统
硬件环境
软件环境(from cmake)
编译
PowerInfer: cmake -S . -B build -DLLAMA_CUBLAS=ON cmake --build build --config Release
llama.cpp cmake -S . -B build -DLLAMA_CUBLAS=ON cmake --build build --config Release
模型
PowerInfer 使用的模型: https://huggingface.co/PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF llama.cpp 使用的模型: https://huggingface.co/SparseLLM/ReluLLaMA-7B convert 转换命令: python3 convert.py ./ReluLLaMA-7B --outtype f16, 生成 ggml-model-f16.gguf
运行&结果
PowerInfer: ./build/bin/main -m ../models/PowerInfer_ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 1000
llama.cpp: ./build/bin/main -m ../models/ggml-model-f16.gguf -n 128 -t 8 -p "Once upon a time" -ngl 100
疑问
在 eval time, PowerInfer 的速度是 17.15 tokens per second, llama.cpp 的速度是 63.86 tokens per second, 是不是我 PowerInfer 的配置不对导致的, 辛苦帮忙提些建议哈~