Add wkv v5 custom operator

Before the change (model is RWKV-5-World-3B-v2-OnlyForTest_86%25_trained-20231108-ctx4096-Q5_1.bin):

Will allocate 180 MB
CPU, 24 threads, sequence of 1: 89 ms per token

Will allocate 290 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 59 ms per token

Will allocate 946 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 41 ms per token

Will allocate 3568 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 45 ms per token

Will allocate 7064 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 49 ms per token

After the change:

Will allocate 102 MB
CPU, 24 threads, sequence of 1: 70 ms per token

Will allocate 112 MB (sequence_length = 2)
CPU, 24 threads, sequence of 2: 37 ms per token

Will allocate 170 MB (sequence_length = 8)
CPU, 24 threads, sequence of 8: 17 ms per token

Will allocate 399 MB (sequence_length = 32)
CPU, 24 threads, sequence of 32: 14 ms per token

Will allocate 706 MB (sequence_length = 64)
CPU, 24 threads, sequence of 64: 14 ms per token

RWKV / rwkv.cpp

Add wkv v5 custom operator #148