MollySophia / rwkv-qualcomm

Inference rwkv5 or rwkv6 with Qualcomm AI Engine Direct SDK
37 stars 3 forks source link

Inference RWKV on Qualcomm HTP (Hexagon Tensor Processor) using QNN SDK

Features

Prerequisites

Usage

1. Convert model weights to QNN model library file.

Converting a FP16 model

Converting an A16W8 model

Converting an A16W4 model

The outputs will be in lib/ directory. The model library contains weights, as well as the functions to prepare the graph. This can either be called on device using libraries in lib/aarch64-android/, or be prepared on the x86 host machine using lib/x86_64-linux-clang/ to generate an HTP context cache. Qualcomm HTP has a limitation on the size of the model library file, so the model will be split into multiple chunks.

2. Generate HTP context cache

3. Run inference on the device

3.1. Running on Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)

Example output:

RWKV v6 1B6 A16W4

130|houji:/data/local/tmp/rwkv $ ./rwkv-qualcomm-demo b_rwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Loading model context binary from RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Buffer size: 719802320
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
Buffer size: 586727640
User: 请为我写一首诗。

Assistant: 当然,请告诉我你喜欢什么类型的诗歌。

User: 请写一首描写秋天景色的诗。

Assistant: 秋意渐浓,寒意渐深,
大地已是金黄如火,
落英纷飞,树影绰约,
人心也随之变得清静。
夜空中的繁星在闪闪,
思念似要被所有握住,
但又像是永不消散的孤注,
在这个秋天里如此特别。

请问这首诗符合您需求吗?

Average time per token: 0.0235644s
Average tokens per second: 42.4368

Performance

Running on the Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14) Model Precision Generation Tokens per second LAMBADA ppl, acc
RWKV v6 1.6B att-a16w8 + ffn-a16w4 42.4368 TODO
RWKV v6 1.6B a16w8 31.6564 4.75009,66.3497%
RWKV v6 1.6B fp16 15.0434 4.63598,67.2618%
RWKV v6 3B att-a16w8 + ffn-a16w4 21.3172 TODO
RWKV v6 3B a16w8 16.2146 TODO

Obsolete data in previous versions for comparison:

Model Precision Generation Tokens per second LAMBADA ppl, acc
RWKV v6 1.6B att-a16w8 + ffn-a16w4 32.6703 4.65837,66.7378%
RWKV v6 1.6B a16w8 26.0707 4.59243,67.3006%
RWKV v6 1.6B fp16 15.0434 4.63598,67.2618%
RWKV v6 3B att-a16w8 + ffn-a16w4 17.3968 4.46606,68.8725%

TODO