Before the change (model is RWKV-5-World-3B-v2-OnlyForTest_86%25_trained-20231108-ctx4096-Q5_1.bin):
Will allocate 180 MB
CPU, 24 threads, sequence of 1: 89 ms per token
Will allocate 290 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 59 ms per token
Will allocate 946 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 41 ms per token
Will allocate 3568 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 45 ms per token
Will allocate 7064 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 49 ms per token
After the change:
Will allocate 102 MB
CPU, 24 threads, sequence of 1: 70 ms per token
Will allocate 112 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 37 ms per token
Will allocate 170 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 17 ms per token
Will allocate 399 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 14 ms per token
Will allocate 706 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 14 ms per token
Before the change (model is
RWKV-5-World-3B-v2-OnlyForTest_86%25_trained-20231108-ctx4096-Q5_1.bin
):After the change: