bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.69k stars 61 forks source link

[BUG]pyo3_runtime.PanicException: no entry found for key #31

Closed wbbeyourself closed 1 week ago

wbbeyourself commented 1 week ago

env: python 3.10 OS: Windows 11 chonkie version: 0.1.2

Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe. <class 'str'> 75

thread '' panicked at src\lib.rs:82:26: no entry found for key stack backtrace: note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace. Traceback (most recent call last): File "D:\X_Projects\transformers_demo\chonkie_demo.py", line 25, in chunks = chunker.chunk(text_content)nt) File "D:\ProgramFiles\miniforge3\lib\site-packages\chonkie\chunker\token.py", line 51, in chunk text_tokens = self._encode(text) File "D:\ProgramFiles\miniforge3\lib\site-packages\chonkie\chunker\base.py", line 99, in _encode return self.tokenizer.encode(text) File "D:\ProgramFiles\miniforge3\lib\site-packages\tiktoken\core.py", line 122, in encode return self._core_bpe.encode(text, allowed_special) pyo3_runtime.PanicException: no entry found for key

source code is :

from chonkie import TokenChunker
from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained('gpt2')

chunker = TokenChunker(
    tokenizer=tokenizer,
    chunk_size=512,
    chunk_overlap=128
)

text_content = 'Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.'
print(text_content)
print(type(text_content))
print(len(text_content))

chunks = chunker.chunk(text_content)
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
bhavnicksm commented 1 week ago

Hey @wbbeyourself!

I tried to reproduce the error in a colab but it was working fine.

image

The colab notebook was running Py3.10, but it worked on Linux instead of Windows—maybe there's an environment issue? Could you paste your Python environment?

Even better would be if you're able to reproduce the issue in a colab notebook so I can fix it ASAP?

Thanks 😊

wbbeyourself commented 1 week ago

Here is my pip list result

pip list
Package                 Version
----------------------- --------------
asttokens               2.4.1
autotiktokenizer        0.2.1
charset-normalizer      3.3.2
chonkie                 0.1.2
conda                   24.7.1
huggingface-hub         0.26.2
ipython                 8.26.0
Jinja2                  3.1.4
nltk                    3.8.2
numpy                   2.0.2
openai                  1.41.0
pip                     24.0
sentence-transformers   3.3.0
sentencepiece           0.2.0
tiktoken                0.8.0
tokenizers              0.20.3
torch                   2.4.0
transformers            4.46.2
bhavnicksm commented 1 week ago

Hey @wbbeyourself,

Sorry, but I am unable to reproduce the same issue you are seeing. For now, the work around for this would be to try loading up a tokenizers tokenizer, since I presume there's some issue in tiktoken interfacing with autotiktokenizer.

You can do so via this:

from chonkie import TokenChunker
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('gpt2')

chunker = TokenChunker(
    tokenizer=tokenizer,
    chunk_size=512,
    chunk_overlap=128
)

text_content = 'Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.'
print(text_content)
print(type(text_content))
print(len(text_content))

chunks = chunker.chunk(text_content)
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

This should work as well. Let me know if this resolved the issue for you!

bhavnicksm commented 1 week ago

Issue has been closed due to not being reproducible, Please re-open the issue if you can re-produce it in a general environment.

Thanks! 😊