Open a136214808 opened 1 week ago
Hey @a136214808!
Yes, you should ideally be able to use an offline tokenizer, but the AutoTikTokenizer repository doesn't yet support this. I'll add this to the issues on AutoTikTokenizer.
For now, if you are using gpt2
itself, I would suggest using tiktoken tokenizer directly or passing in the string "gpt2" to the chunker you are using and letting it initialise the optimal tokenizer.
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
from chonkie import TokenChunker
chunker = TokenChunker(tokenizer="gpt2")
Mentioning the Issue on AutoTikTokenizer here for tracking: issue
I want to load offline gpt2, however it can't load directly, so did I do something wrong?
code: from autotiktokenizer import AutoTikTokenizer tokenizer = AutoTikTokenizer.from_pretrained("home/xxx/gpt2")
Error: HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'home/xxx/gpt2'. Use
repo_type
argument if needed.