RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.41k stars 95 forks source link

Add wkv v5 custom operator #148

Closed saharNooby closed 10 months ago

saharNooby commented 10 months ago

Before the change (model is RWKV-5-World-3B-v2-OnlyForTest_86%25_trained-20231108-ctx4096-Q5_1.bin):

Will allocate 180 MB
CPU, 24 threads, sequence of 1: 89 ms per token

Will allocate 290 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 59 ms per token

Will allocate 946 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 41 ms per token

Will allocate 3568 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 45 ms per token

Will allocate 7064 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 49 ms per token

After the change:

Will allocate 102 MB
CPU, 24 threads, sequence of 1: 70 ms per token

Will allocate 112 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 37 ms per token

Will allocate 170 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 17 ms per token

Will allocate 399 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 14 ms per token

Will allocate 706 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 14 ms per token