你好,请问下README中的
"基于上述的几个优化技术,我们在千卡 A800 显卡上达到了 7B 模型 182 TFLOPS 的吞吐,GPU 峰值算力利用率高达 58.3%。"
182 TFLOPS指的是model TFLOPS 还是 hardware TLOPS
https://proceedings.mlsys.org/paper_files/paper/2023/file/e851ca7b43815718fbbac8afb2246bf8-Paper-mlsys2023.pdf
"We define the model FLOPs utilization (MFU) and hardware FLOPs utilization (HFU) similar to Chowdhery, et al. [6]. Model FLOPs are the floating point operations required to perform a single forward and backward pass (single iteration) regardless of the implementations and hardware limi- tations. As a result, model FLOPs are hardware and implementation independent and only depend on the underlying model. On the other hand, the hardware FLOPs represent the floating point op- erations that are actually performed on the hardware per iteration. Therefore, if an implementation requires activation recomputation (for example ours), then the hardware FLOPs are going to be larger than model FLOPs. We provide a tight lower bound formula for the model and hardware FLOPs in Appendix A. For our method, the hardware to model FLOPs ratio is approximately 1 + s/6h."
Checklist
[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.
Required prerequisites
Questions
你好,请问下README中的 "基于上述的几个优化技术,我们在千卡 A800 显卡上达到了 7B 模型 182 TFLOPS 的吞吐,GPU 峰值算力利用率高达 58.3%。" 182 TFLOPS指的是model TFLOPS 还是 hardware TLOPS https://proceedings.mlsys.org/paper_files/paper/2023/file/e851ca7b43815718fbbac8afb2246bf8-Paper-mlsys2023.pdf "We define the model FLOPs utilization (MFU) and hardware FLOPs utilization (HFU) similar to Chowdhery, et al. [6]. Model FLOPs are the floating point operations required to perform a single forward and backward pass (single iteration) regardless of the implementations and hardware limi- tations. As a result, model FLOPs are hardware and implementation independent and only depend on the underlying model. On the other hand, the hardware FLOPs represent the floating point op- erations that are actually performed on the hardware per iteration. Therefore, if an implementation requires activation recomputation (for example ours), then the hardware FLOPs are going to be larger than model FLOPs. We provide a tight lower bound formula for the model and hardware FLOPs in Appendix A. For our method, the hardware to model FLOPs ratio is approximately 1 + s/6h."
Checklist