Closed wbbeyourself closed 1 week ago
Hey @wbbeyourself!
I tried to reproduce the error in a colab but it was working fine.
The colab notebook was running Py3.10, but it worked on Linux instead of Windows—maybe there's an environment issue? Could you paste your Python environment?
Even better would be if you're able to reproduce the issue in a colab notebook so I can fix it ASAP?
Thanks 😊
Here is my pip list result
pip list
Package Version
----------------------- --------------
asttokens 2.4.1
autotiktokenizer 0.2.1
charset-normalizer 3.3.2
chonkie 0.1.2
conda 24.7.1
huggingface-hub 0.26.2
ipython 8.26.0
Jinja2 3.1.4
nltk 3.8.2
numpy 2.0.2
openai 1.41.0
pip 24.0
sentence-transformers 3.3.0
sentencepiece 0.2.0
tiktoken 0.8.0
tokenizers 0.20.3
torch 2.4.0
transformers 4.46.2
Hey @wbbeyourself,
Sorry, but I am unable to reproduce the same issue you are seeing. For now, the work around for this would be to try loading up a tokenizers
tokenizer, since I presume there's some issue in tiktoken
interfacing with autotiktokenizer
.
You can do so via this:
from chonkie import TokenChunker
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('gpt2')
chunker = TokenChunker(
tokenizer=tokenizer,
chunk_size=512,
chunk_overlap=128
)
text_content = 'Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.'
print(text_content)
print(type(text_content))
print(len(text_content))
chunks = chunker.chunk(text_content)
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Tokens: {chunk.token_count}")
This should work as well. Let me know if this resolved the issue for you!
Issue has been closed due to not being reproducible, Please re-open the issue if you can re-produce it in a general environment.
Thanks! 😊
env: python 3.10 OS: Windows 11 chonkie version: 0.1.2
Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe. <class 'str'> 75
thread '' panicked at src\lib.rs:82:26:
no entry found for key
stack backtrace:
note: Some details are omitted, run with
chunks = chunker.chunk(text_content)nt)
File "D:\ProgramFiles\miniforge3\lib\site-packages\chonkie\chunker\token.py", line 51, in chunk
text_tokens = self._encode(text)
File "D:\ProgramFiles\miniforge3\lib\site-packages\chonkie\chunker\base.py", line 99, in _encode
return self.tokenizer.encode(text)
File "D:\ProgramFiles\miniforge3\lib\site-packages\tiktoken\core.py", line 122, in encode
return self._core_bpe.encode(text, allowed_special)
pyo3_runtime.PanicException: no entry found for key
RUST_BACKTRACE=full
for a verbose backtrace. Traceback (most recent call last): File "D:\X_Projects\transformers_demo\chonkie_demo.py", line 25, insource code is :