convert_model.py
: usage: convert_model.py [-h] [--chunks CHUNKS] [--use_qnn_quant] [--act_bitwidth ACT_BITWIDTH] [--weights_bitwidth WEIGHTS_BITWIDTH] [--ext_embedding] modelpython convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 4
make_calibration_samples.py
: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunkspython make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
python convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 2 --use_qnn_quant --calib_data_path ./samples_1b6
chunks
parameter the same in both scripts.make_calibration_samples.py
: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunkspython make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
--linear_param_encodings quant_encodings/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_mse_rwkv_gptq_exceptions_asym_torch_w4.encodings
(The quantization encodings are either from the pre-calculated ones (GDrive), or generated using AIMET. Refer to: AIMET_quant.md)chunks
parameter the same in both scripts.The outputs will be in lib/
directory. The model library contains weights, as well as the functions to prepare the graph. This can either be called on device using libraries in lib/aarch64-android/
, or be prepared on the x86 host machine using lib/x86_64-linux-clang/
to generate an HTP context cache. Qualcomm HTP has a limitation on the size of the model library file, so the model will be split into multiple chunks.
make_context_cache_binary.py
: usage: make_context_cache_binary.py [-h] model_lib output_path {SM8650,SM8550,SC8380}$ python make_context_cache_binary.py ./lib/x86_64-linux-clang/libRWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.so output/ SM8650
output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
and output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
.make -C librwkv-qualcomm
adb push librwkv-qualcomm/obj/local/arm64-v8a/rwkv-qualcomm-demo /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin /data/local/tmp/
adb push assets/brwkv_vocab_v20230424.txt /data/local/tmp/
/data/local/tmp/
(Please change the HTP V75 version to the one you have):
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnSystem.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpV75Stub.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so
onnx/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.emb
to /data/local/tmp/rwkv/
too.adb shell
$ cd /data/local/tmp
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp
$ # Specify the path to the first model chunk. The second chunk will be loaded automatically.
$ ./rwkv-qualcomm-demo brwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
RWKV v6 1B6 A16W4
130|houji:/data/local/tmp/rwkv $ ./rwkv-qualcomm-demo b_rwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Loading model context binary from RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Buffer size: 719802320
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
Buffer size: 586727640
User: 请为我写一首诗。
Assistant: 当然,请告诉我你喜欢什么类型的诗歌。
User: 请写一首描写秋天景色的诗。
Assistant: 秋意渐浓,寒意渐深,
大地已是金黄如火,
落英纷飞,树影绰约,
人心也随之变得清静。
夜空中的繁星在闪闪,
思念似要被所有握住,
但又像是永不消散的孤注,
在这个秋天里如此特别。
请问这首诗符合您需求吗?
Average time per token: 0.0235644s
Average tokens per second: 42.4368
Running on the Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14) |
Model | Precision | Generation Tokens per second | LAMBADA ppl, acc |
---|---|---|---|---|
RWKV v6 1.6B | att-a16w8 + ffn-a16w4 | 42.4368 | TODO | |
RWKV v6 1.6B | a16w8 | 31.6564 | 4.75009,66.3497% | |
RWKV v6 1.6B | fp16 | 15.0434 | 4.63598,67.2618% | |
RWKV v6 3B | att-a16w8 + ffn-a16w4 | 21.3172 | TODO | |
RWKV v6 3B | a16w8 | 16.2146 | TODO |
Model | Precision | Generation Tokens per second | LAMBADA ppl, acc |
---|---|---|---|
RWKV v6 1.6B | att-a16w8 + ffn-a16w4 | 32.6703 | 4.65837,66.7378% |
RWKV v6 1.6B | a16w8 | 26.0707 | 4.59243,67.3006% |
RWKV v6 1.6B | fp16 | 15.0434 | 4.63598,67.2618% |
RWKV v6 3B | att-a16w8 + ffn-a16w4 | 17.3968 | 4.46606,68.8725% |
Average tokens per second: 50.7313
Average tokens per second: 142.286