HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"
https://arxiv.org/abs/2410.05779
MIT License
9.28k stars 1.14k forks source link

bug fix issue #95 #215

Closed benx13 closed 2 weeks ago

benx13 commented 2 weeks ago

LightRAG Bug Fix Report

Issue

A TypeError was occurring in the hybrid query mode when trying to access content from text units that contained None values. The error specifically occurred in the _find_most_related_text_unit_from_entities function when trying to process text units for token size truncation.

Root Cause

The issue stemmed from insufficient null checks when processing text units in the knowledge graph. Specifically:

  1. Text unit data could be None when retrieved from text_chunks_db
  2. The data dictionary could be missing the 'content' field
  3. No proper filtering of invalid entries before token size truncation

Key problematic area was in:

591:597:LightRAG/lightrag/operate.py

    if any([v is None for v in all_text_units_lookup.values()]):
        logger.warning("Text chunks are missing, maybe the storage is damaged")
    all_text_units = [
        {"id": k, **v} for k, v in all_text_units_lookup.items() if v is not None
    ]
    all_text_units = sorted(
        all_text_units, key=lambda x: (x["order"], -x["relation_counts"])

Solution

Added comprehensive null checks and data validation throughout the text unit processing pipeline:

  1. Added null check for node data and source_id field:

571:575:LightRAG/lightrag/operate.py

        for k, v in zip(all_one_hop_nodes, all_one_hop_nodes_data)
        if v is not None
    }
    all_text_units_lookup = {}
    for index, (this_text_units, this_edges) in enumerate(zip(text_units, edges)):
  1. Added content validation when getting chunk data:

591:597:LightRAG/lightrag/operate.py

    if any([v is None for v in all_text_units_lookup.values()]):
        logger.warning("Text chunks are missing, maybe the storage is damaged")
    all_text_units = [
        {"id": k, **v} for k, v in all_text_units_lookup.items() if v is not None
    ]
    all_text_units = sorted(
        all_text_units, key=lambda x: (x["order"], -x["relation_counts"])
  1. Added comprehensive filtering for None values:

599:604:LightRAG/lightrag/operate.py

    all_text_units = truncate_list_by_token_size(
        all_text_units,
        key=lambda x: x["data"]["content"],
        max_token_size=query_param.max_token_for_text_unit,
    )
    all_text_units: list[TextChunkSchema] = [t["data"] for t in all_text_units]

The changes are backward compatible and require no modifications to the existing API or data structures.

benx13 commented 2 weeks ago

Fixes #95

LarFii commented 2 weeks ago

Thanks for your contribution!