bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.55k stars 57 forks source link

Can I load offline tokenizers in it? #23

Open a136214808 opened 1 week ago

a136214808 commented 1 week ago

I want to load offline gpt2, however it can't load directly, so did I do something wrong?

code: from autotiktokenizer import AutoTikTokenizer tokenizer = AutoTikTokenizer.from_pretrained("home/xxx/gpt2")

Error: HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'home/xxx/gpt2'. Use repo_type argument if needed.

bhavnicksm commented 1 week ago

Hey @a136214808!

Yes, you should ideally be able to use an offline tokenizer, but the AutoTikTokenizer repository doesn't yet support this. I'll add this to the issues on AutoTikTokenizer.

For now, if you are using gpt2 itself, I would suggest using tiktoken tokenizer directly or passing in the string "gpt2" to the chunker you are using and letting it initialise the optimal tokenizer.

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
from chonkie import TokenChunker

chunker = TokenChunker(tokenizer="gpt2")
bhavnicksm commented 1 week ago

Mentioning the Issue on AutoTikTokenizer here for tracking: issue