Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.09k stars 1.32k forks source link

[BUG] - <Encoding in Windows isn't considered well. We are still users> #340

Open zjiang4 opened 1 month ago

zjiang4 commented 1 month ago

Description

None of the encoding strategies work. Windows Users are not welcome. I have tried all kinds of utf=8 encoding. Adding to .py files and setting to default environment. Please consider many users are using win1x. Traditional RAG works good, but GRAPHRAG isn't working right. No outputs are allowed to generate after indexing. Sad

FOR ALL PDFS i UPLOAD

Indexing [1/1]: semRegularized.pdf => Converting semRegularized.pdf to text => Converted semRegularized.pdf to text => [semRegularized.pdf] Processed 44 chunks => Finished indexing semRegularized.pdf Error: 'gbk' codec can't encode character '\xa9' in position 127: illegal multibyte sequence

FOR ALL TEXTs I upload

D:\anaconda3\envs\kotaemon\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, **kwds) Traceback (most recent call last): File "D:\anaconda3\envs\kotaemon\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\anaconda3\envs\kotaemon\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index__main.py", line 104, in index_cli( File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index\cli.py", line 178, in index_cli progress_reporter.stop() File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index\progress\rich.py", line 119, in stop self._live.stop() File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\live.py", line 147, in stop with self.console: File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 864, in exit__ self._exit_buffer() File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 822, in _exit_buffer self._check_buffer() File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 2024, in _check_buffer self._write_buffer() File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 2060, in _write_buffer legacy_windows_render(buffer, LegacyWindowsTerm(self.file)) File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich_windows_renderer.py", line 19, in legacy_windows_render term.write_text(text) File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich_win32_console.py", line 403, in write_text self.write(text) UnicodeEncodeError: 'gbk' codec can't encode character '\u280b' in position 0: illegal multibyte sequence

Reproduction steps

1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

Screenshots

![DESCRIPTION](LINK.png)

Logs

D:\anaconda3\envs\kotaemon\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
Traceback (most recent call last):
  File "D:\anaconda3\envs\kotaemon\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\anaconda3\envs\kotaemon\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index\__main__.py", line 104, in <module>
    index_cli(
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index\cli.py", line 178, in index_cli
    progress_reporter.stop()
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\graphrag\index\progress\rich.py", line 119, in stop
    self._live.stop()
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\live.py", line 147, in stop
    with self.console:
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 864, in __exit__
    self._exit_buffer()
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 822, in _exit_buffer
    self._check_buffer()
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 2024, in _check_buffer
    self._write_buffer()
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\console.py", line 2060, in _write_buffer
    legacy_windows_render(buffer, LegacyWindowsTerm(self.file))
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\_windows_renderer.py", line 19, in legacy_windows_render
    term.write_text(text)
  File "D:\anaconda3\envs\kotaemon\lib\site-packages\rich\_win32_console.py", line 403, in write_text
    self.write(text)
UnicodeEncodeError: 'gbk' codec can't encode character '\u280b' in position 0: illegal multibyte sequence
use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x000001E2A1E84130>
Page numbers: 22
Got 22 page thumbnails
Adding documents to doc store
indexing step took 6.375478029251099

Browsers

Chrome

OS

Windows

Additional information

No response

chenchunhao9125 commented 1 month ago

I have the same problem

Jainbaba commented 1 month ago

373 Is the same bug, check out the comments in this Bug, There is a issue with the is the llama_index SimpleDirectoryReader Class