Closed ArEnSc closed 1 year ago
The issue with having complete inference on C side is implementing tokenizer. I tried to implement it, but got stuck on Unicode NFC normalization and proper regexes with Unicode support -- C/C++ ecosystem is not pretty for such tasks, if the goal is to have minimized, single-file code.
Some implementation of tokenizer in C can be approximated: we can ignore non-latin characters and normalization, for example. But I would rather have no tokenizer at all than to have half-working tokenizer.
Without properly working tokenizer, I see no value of having sampling code in C -- what's the point of sampling tokens, if you still need to go to Python for decoding? Better then to do saplming in Python too...
BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist
BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist
thank you so much for this <3 the test cases are awesome!!
https://github.com/saharNooby/rwkv.cpp/assets/4723091/4051fb3d-a131-41d1-ae15-806da7b01161
The issue with having complete inference on C side is implementing tokenizer. I tried to implement it, but got stuck on Unicode NFC normalization and proper regexes with Unicode support -- C/C++ ecosystem is not pretty for such tasks, if the goal is to have minimized, single-file code.
Some implementation of tokenizer in C can be approximated: we can ignore non-latin characters and normalization, for example. But I would rather have no tokenizer at all than to have half-working tokenizer.
Without properly working tokenizer, I see no value of having sampling code in C -- what's the point of sampling tokens, if you still need to go to Python for decoding? Better then to do saplming in Python too...
BTW, if someone wants to take a shot in implenting proper BPE tokenizer, here is a pure Python impl that can be ported: gist
now that the world tokenizer exists, it is feasible to have a tokenizer implementation in rwkv.cpp (it is only around 100 lines). I just finished a proof of concept here but I swear you do not want to look at those header files that are being included. it is one stage of madness away from deterministic finite automata (which would take linear time but probably use gigabytes of memory).
I'm still in love with the end result that requires no runtime parsing and is probably an order of magnitude faster than the python trie implementation, but maybe I sacrificed a lot to get there
both top-p and top-k token sampling would be dead simple from there, and would make it possible to cut out python entirely, given already-preprocessed model files.
Duplicated by Can we have an example of pure C ++? #112 , pure C++ inference implies tokenizing & basic sampling.
Hey I have noticed this doesn't seem to contain samplers in c I was wondering would it be difficult to implement? why not just copy the llama samplers? stupid question likely! I am not a CPP or ggml pro sorry