BBuf / RWKV-World-HF-Tokenizer

31 stars 5 forks source link

how to create tokenizer.json #4

Open xinyinan9527 opened 2 months ago

xinyinan9527 commented 2 months ago

I need to use tokenizer.json in my project, how should I create it?

xinyinan9527 commented 2 months ago

sorry,when use rwkv5-world model,i can not get the tokenizer.json

BBuf commented 2 months ago

Don't need tokenizer.json , the RWKV5Tokenzier is custom implementation. the tokenizer.json equals to https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/rwkv5_world_tokenizer/vocab.txt .

xinyinan9527 commented 2 months ago

Don't need tokenizer.json , the RWKV5Tokenzier is custom implementation. the tokenizer.json equals to https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/rwkv5_world_tokenizer/vocab.txt .

yes,it works nice.But i want to use metapipe+rwkv5,it needs tokenizer.json. At rwkv4,we have "20B_tokenizer.json" and can get the file of "tokenizer.json" via Pretrainedtokenizerfast function.20B_tokenizer.json is created by BPE? Now,i am using a method convert "slow tokenizer" to "fast tokenizer" to get the rwkv5"tokenizer.json" file,it seems too hard. At here ,maybe it is your test file,https://github.com/huggingface/transformers/blob/0adb55f21c9c7d2b7ac92d17ab4a37655edce67f/out/tokenizer.json,it seems also from BEP. Do you have another method?thank you.

BBuf commented 1 month ago

Don't need tokenizer.json , the RWKV5Tokenzier is custom implementation. the tokenizer.json equals to https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/rwkv5_world_tokenizer/vocab.txt .

yes,it works nice.But i want to use metapipe+rwkv5,it needs tokenizer.json. At rwkv4,we have "20B_tokenizer.json" and can get the file of "tokenizer.json" via Pretrainedtokenizerfast function.20B_tokenizer.json is created by BPE? Now,i am using a method convert "slow tokenizer" to "fast tokenizer" to get the rwkv5"tokenizer.json" file,it seems too hard. At here ,maybe it is your test file,https://github.com/huggingface/transformers/blob/0adb55f21c9c7d2b7ac92d17ab4a37655edce67f/out/tokenizer.json,it seems also from BEP. Do you have another method?thank you.

It's very diffucult, I had try use tokenizer.json previous but the result would be wrong because vocab.txt has differenct encoding style token.