alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
548 stars 19 forks source link

C implementation #17

Open abb128 opened 1 year ago

abb128 commented 1 year ago

Great work! I noticed however there's no implementation in C or C++, only in higher-level languages which may make it difficult to integrate into projects like llama.cpp. Is this something being worked on?

alasdairforsythe commented 1 year ago

It will eventually get a native C implementation, but not in the near future. In the meantime its possible to export link from Go to C, via a C wrapper. I believe it's possible, but I've not looked into it yet. Let me know if you have any suggestions on how best to do it.