UbiquitousLearning / mllm

Fast Multimodal LLM on Mobile Devices
https://ubiquitouslearning.github.io/mllm_website
MIT License
390 stars 47 forks source link

Prefill speed is approximately 4~6 tokens/s for Qwen1.5-1.8B #116

Open mengllm opened 1 month ago

mengllm commented 1 month ago

Hi, mllm-qnn can work on my device oppo findx7 ultra(snapdragon 8gen 3+16G RAM). However, the prefill speed for Qwen1.5-1.8B is approximately 4-6 tokens per second, which significantly diverges from the 1000 tokens per second claimed in the paper. Based on our tests, npuExe.run takes approximately 15 seconds to process 64 tokens:

        auto startTime = currentMs();

        do {
            // 1: Prefill stage using NPU chunk execute
            npuExe.run(npu_ctx, &npuNet, {input});
            auto result = npuExe.result();

        int duration = (int) (currentMs() - startTime);
         std::cout << "input_tensor.sequence()=" << input_tensor.sequence() << std::endl;
        std::cout << "prefill cost: " << duration << "ms prefill speed: " << input_tensor.sequence() * 1000 / duration << "token/s" << std::endl;

Could you provide some suggestions?

oreomaker commented 1 month ago

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

mengllm commented 1 month ago

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

mengllm commented 1 month ago

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

@oreomaker Could you update your test result for prefilling stage?

oreomaker commented 1 month ago

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.

liangzelang commented 3 weeks ago

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.

It's great works [1000 t/s prefill speed] on Heaxgon NPU , but is there a roadmap to indicate when the prefill speed can be available memtioned in the paper.