cryscan / web-rwkv

Implementation of the RWKV language model in pure WebGPU/Rust.
Other
210 stars 15 forks source link

Huggingface tokenizer support #23

Open melang982 opened 3 months ago

melang982 commented 3 months ago

Adding Huggingface tokenizer support. This is useful for RWKV models that were trained with a custom tokenizer, especially since RWKV tokenizer training code is not available. Useful for experiments such as per-character tokenizer or custom datasets such as music, timeseries, rare languages etc

Checked with ai00_server and my trained from scratch RWKV model that uses BBPE HF tokenizer - it works 🎉

cryscan commented 3 months ago

Thanks! Actually tokenizer is not a that "hard dependency" that applications built upon web-rwkv would rely on -- they are always free to implement their own tokenizer. This is because web-rwkv's model APIs only see tokens, not text strings. I will review this after #24 is merged.