Open daamazonbird opened 1 week ago
Could you provide more details about the context or specific setup where this issue arises? Is there a way to reliably reproduce the issue? This seems unusual, and providing more detailed information could help identify the root cause and propose a solution.
Perhaps it's just me, but I think the team needs to investigate how LightRAG handles and generates its indices, especially around character encoding. I can't get around this UTF-8 encoding error. I managed to pin this to a character at index 232 ۬ with a byte value 0x6ec. I couldn't remove or replace this specific character.
DEBUG:root:Context segment around index 232: [1453, 7879, 27, 91, 29, 7879, 6259 , 7879, 27, 91, 29, 7879, 50, 15918, 311, 769, 575, 1453, 318, 262, 1772, 286, 2 57, 1492, 3706, 705, 32, 15120, 286, 16156, 6, 7256, 284, 607, 2560, 90, 201, 19 8, 220, 366] DEBUG:root:Segment around problematic index: [1453, 7879, 27, 91, 29, 7879, 6259 , 7879, 27, 91, 29, 7879, 50, 15918, 311, 769, 575, 1453, 318, 262, 1772, 286, 2 57, 1492, 3706, 705, 32, 15120, 286, 16156, 6, 7256, 284, 607, 2560, 90, 201, 19 8, 220, 366] INFO:root:Character at local index 0: ֭ (byte: 0x5ad) INFO:root:Character at local index 1: ệ (byte: 0x1ec7) INFO:root:Character at local index 2: ← (byte: 0x1b) INFO:root:Character at local index 3: [ (byte: 0x5b) INFO:root:Character at local index 4: ↔ (byte: 0x1d) INFO:root:Character at local index 5: ệ (byte: 0x1ec7) INFO:root:Character at local index 6: ᡳ (byte: 0x1873) INFO:root:Character at local index 7: ệ (byte: 0x1ec7) INFO:root:Character at local index 8: ← (byte: 0x1b) INFO:root:Character at local index 9: [ (byte: 0x5b) INFO:root:Character at local index 10: ↔ (byte: 0x1d) INFO:root:Character at local index 11: ệ (byte: 0x1ec7) INFO:root:Character at local index 12: 2 (byte: 0x32) INFO:root:Character at local index 13: 㸮 (byte: 0x3e2e) INFO:root:Character at local index 14: ķ (byte: 0x137) INFO:root:Character at local index 15: ́ (byte: 0x301) INFO:root:Character at local index 16: ȿ (byte: 0x23f) INFO:root:Character at local index 17: ֭ (byte: 0x5ad) INFO:root:Character at local index 18: ľ (byte: 0x13e) INFO:root:Character at local index 19: Ć (byte: 0x106) INFO:root:Character at local index 20: ۬ (byte: 0x6ec) INFO:root:Character at local index 21: Ğ (byte: 0x11e) INFO:root:Character at local index 22: ā (byte: 0x101) INFO:root:Character at local index 23: ה (byte: 0x5d4) INFO:root:Character at local index 24: (byte: 0xe7a) INFO:root:Character at local index 25: ˁ (byte: 0x2c1) INFO:root:Character at local index 26: (byte: 0x20) INFO:root:Character at local index 27: 㬐 (byte: 0x3b10) INFO:root:Character at local index 28: Ğ (byte: 0x11e) INFO:root:Character at local index 29: 㼜 (byte: 0x3f1c) INFO:root:Character at local index 30: ♠ (byte: 0x6) INFO:root:Character at local index 31: ᱘ (byte: 0x1c58) INFO:root:Character at local index 32: Ĝ (byte: 0x11c) INFO:root:Character at local index 33: ɟ (byte: 0x25f) INFO:root:Character at local index 34: (byte: 0xa00) INFO:root:Character at local index 35: Z (byte: 0x5a) INFO:root:Character at local index 36: É (byte: 0xc9) INFO:root:Character at local index 37: Æ (byte: 0xc6) INFO:root:Character at local index 38: Ü (byte: 0xdc) INFO:root:Character at local index 39: Ů (byte: 0x16e) The log output provides a segment of the context around index 232 and the corresponding characters and their byte values. The problematic symbols aren't part of the original text but are introduced during indexing.