HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"
https://arxiv.org/abs/2410.05779
MIT License
9.26k stars 1.14k forks source link

LightRAG indexing causing problematic symbols that don't conform to UTF-8 encoding standards #262

Open daamazonbird opened 1 week ago

daamazonbird commented 1 week ago

Perhaps it's just me, but I think the team needs to investigate how LightRAG handles and generates its indices, especially around character encoding. I can't get around this UTF-8 encoding error. I managed to pin this to a character at index 232 ۬ with a byte value 0x6ec. I couldn't remove or replace this specific character.

DEBUG:root:Context segment around index 232: [1453, 7879, 27, 91, 29, 7879, 6259 , 7879, 27, 91, 29, 7879, 50, 15918, 311, 769, 575, 1453, 318, 262, 1772, 286, 2 57, 1492, 3706, 705, 32, 15120, 286, 16156, 6, 7256, 284, 607, 2560, 90, 201, 19 8, 220, 366] DEBUG:root:Segment around problematic index: [1453, 7879, 27, 91, 29, 7879, 6259 , 7879, 27, 91, 29, 7879, 50, 15918, 311, 769, 575, 1453, 318, 262, 1772, 286, 2 57, 1492, 3706, 705, 32, 15120, 286, 16156, 6, 7256, 284, 607, 2560, 90, 201, 19 8, 220, 366] INFO:root:Character at local index 0: ֭ (byte: 0x5ad) INFO:root:Character at local index 1: ệ (byte: 0x1ec7) INFO:root:Character at local index 2: ← (byte: 0x1b) INFO:root:Character at local index 3: [ (byte: 0x5b) INFO:root:Character at local index 4: ↔ (byte: 0x1d) INFO:root:Character at local index 5: ệ (byte: 0x1ec7) INFO:root:Character at local index 6: ᡳ (byte: 0x1873) INFO:root:Character at local index 7: ệ (byte: 0x1ec7) INFO:root:Character at local index 8: ← (byte: 0x1b) INFO:root:Character at local index 9: [ (byte: 0x5b) INFO:root:Character at local index 10: ↔ (byte: 0x1d) INFO:root:Character at local index 11: ệ (byte: 0x1ec7) INFO:root:Character at local index 12: 2 (byte: 0x32) INFO:root:Character at local index 13: 㸮 (byte: 0x3e2e) INFO:root:Character at local index 14: ķ (byte: 0x137) INFO:root:Character at local index 15: ́ (byte: 0x301) INFO:root:Character at local index 16: ȿ (byte: 0x23f) INFO:root:Character at local index 17: ֭ (byte: 0x5ad) INFO:root:Character at local index 18: ľ (byte: 0x13e) INFO:root:Character at local index 19: Ć (byte: 0x106) INFO:root:Character at local index 20: ۬ (byte: 0x6ec) INFO:root:Character at local index 21: Ğ (byte: 0x11e) INFO:root:Character at local index 22: ā (byte: 0x101) INFO:root:Character at local index 23: ה (byte: 0x5d4) INFO:root:Character at local index 24: ๺ (byte: 0xe7a) INFO:root:Character at local index 25: ˁ (byte: 0x2c1) INFO:root:Character at local index 26: (byte: 0x20) INFO:root:Character at local index 27: 㬐 (byte: 0x3b10) INFO:root:Character at local index 28: Ğ (byte: 0x11e) INFO:root:Character at local index 29: 㼜 (byte: 0x3f1c) INFO:root:Character at local index 30: ♠ (byte: 0x6) INFO:root:Character at local index 31: ᱘ (byte: 0x1c58) INFO:root:Character at local index 32: Ĝ (byte: 0x11c) INFO:root:Character at local index 33: ɟ (byte: 0x25f) INFO:root:Character at local index 34: ਀ (byte: 0xa00) INFO:root:Character at local index 35: Z (byte: 0x5a) INFO:root:Character at local index 36: É (byte: 0xc9) INFO:root:Character at local index 37: Æ (byte: 0xc6) INFO:root:Character at local index 38: Ü (byte: 0xdc) INFO:root:Character at local index 39: Ů (byte: 0x16e) The log output provides a segment of the context around index 232 and the corresponding characters and their byte values. The problematic symbols aren't part of the original text but are introduced during indexing.

LarFii commented 3 days ago

Could you provide more details about the context or specific setup where this issue arises? Is there a way to reliably reproduce the issue? This seems unusual, and providing more detailed information could help identify the root cause and propose a solution.