Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
16.47k stars 1.27k forks source link

[BUG] - 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence When use GraphIndex #256

Open flyboyer opened 1 month ago

flyboyer commented 1 month ago

Description

When I try to build a graph index, I uploaded a PDF file and started building the index. During this process, the following errors will occur:

Indexing [1/1]: small_test.pdf
 => Converting small_test.pdf to text
 => Converted small_test.pdf to text
 => [small_test.pdf] Processed 2 chunks
 => Finished indexing small_test.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at 

c:\Users\**\Desktop\small\remote\kotaemon\ktem_app_data\user_data\files\graphr

ag\8ebbc1ff-2bef-49aa-803a-c72ffcbeb476\output\20240909-162212\reports\indexing

-engine.log

Error: 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence

image

Are there any constraints or limitations on the uploaded PDF document?

Reproduction steps

1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

No response

RealmX1 commented 1 month ago

Encountering same issue using GraphRAG indexing. The UI doesn't provide enough information for debug, nor can I find any logging for it in the console, or a log for debugging GraphRAG indexing process

The same pdf does just fine in normal indexing process

2013 Reinforcement Learning in Robotics - A Survey.pdf

zjiang4 commented 1 month ago

do you solve it yet? cin-jimmy