Open mengllm opened 1 month ago
In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.
In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.
I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.
In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.
I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.
@oreomaker Could you update your test result for prefilling stage?
In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.
I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.
The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.
In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.
I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.
The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.
It's great works [1000 t/s prefill speed] on Heaxgon NPU , but is there a roadmap to indicate when the prefill speed can be available memtioned in the paper.
Hi, mllm-qnn can work on my device oppo findx7 ultra(snapdragon 8gen 3+16G RAM). However, the prefill speed for Qwen1.5-1.8B is approximately 4-6 tokens per second, which significantly diverges from the 1000 tokens per second claimed in the paper. Based on our tests, npuExe.run takes approximately 15 seconds to process 64 tokens:
Could you provide some suggestions?