RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.42k stars 98 forks source link

Basic Samplers? #61

Closed ArEnSc closed 1 year ago

ArEnSc commented 1 year ago

Hey I have noticed this doesn't seem to contain samplers in c I was wondering would it be difficult to implement? why not just copy the llama samplers? stupid question likely! I am not a CPP or ggml pro sorry

saharNooby commented 1 year ago

The issue with having complete inference on C side is implementing tokenizer. I tried to implement it, but got stuck on Unicode NFC normalization and proper regexes with Unicode support -- C/C++ ecosystem is not pretty for such tasks, if the goal is to have minimized, single-file code.

Some implementation of tokenizer in C can be approximated: we can ignore non-latin characters and normalization, for example. But I would rather have no tokenizer at all than to have half-working tokenizer.

Without properly working tokenizer, I see no value of having sampling code in C -- what's the point of sampling tokens, if you still need to go to Python for decoding? Better then to do saplming in Python too...

BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist

LoganDark commented 1 year ago

BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist

thank you so much for this <3 the test cases are awesome!!

image

https://github.com/saharNooby/rwkv.cpp/assets/4723091/4051fb3d-a131-41d1-ae15-806da7b01161

LoganDark commented 1 year ago

The issue with having complete inference on C side is implementing tokenizer. I tried to implement it, but got stuck on Unicode NFC normalization and proper regexes with Unicode support -- C/C++ ecosystem is not pretty for such tasks, if the goal is to have minimized, single-file code.

Some implementation of tokenizer in C can be approximated: we can ignore non-latin characters and normalization, for example. But I would rather have no tokenizer at all than to have half-working tokenizer.

Without properly working tokenizer, I see no value of having sampling code in C -- what's the point of sampling tokens, if you still need to go to Python for decoding? Better then to do saplming in Python too...

BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist

now that the world tokenizer exists, it is feasible to have a tokenizer implementation in rwkv.cpp (it is only around 100 lines). I just finished a proof of concept here but I swear you do not want to look at those header files that are being included. it is one stage of madness away from deterministic finite automata (which would take linear time but probably use gigabytes of memory).

I'm still in love with the end result that requires no runtime parsing and is probably an order of magnitude faster than the python trie implementation, but maybe I sacrificed a lot to get there

both top-p and top-k token sampling would be dead simple from there, and would make it possible to cut out python entirely, given already-preprocessed model files.

saharNooby commented 1 year ago

Duplicated by Can we have an example of pure C ++? #112 , pure C++ inference implies tokenizing & basic sampling.