HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"
https://arxiv.org/abs/2410.05779
MIT License
9.22k stars 1.13k forks source link

A more robust approach for result to json. #302

Closed Luobots closed 2 days ago

Luobots commented 3 days ago

When I using LightRAG, my model will generate text below for keyword extraction, it contains two "{", when using "{" + result.split("{")[1].split("}")[0] + "}", it fails, but using "{" + result.split("{")[-1].split("}")[0] + "}" is ok, and the original expectation still achieved.

Keyword Extraction

To extract high-level and low-level keywords from the given query, we will use Natural Language Processing (NLP) techniques.

import json
import re

def extract_keywords(query):
    # Convert query to lowercase
    query = query.lower()

    # Tokenize the query
    tokens = re.findall(r'\b\w+\b', query)

    # Identify high-level keywords
    high_level_keywords = []
    low_level_keywords = []
    stop_words = ['the', 'and', 'a', 'an', 'in', 'on', 'at', 'by', 'with']

    for token in tokens:
        if token not in stop_words:
            if len(token.split()) > 1:
                high_level_keywords.append(token)
            else:
                low_level_keywords.append(token)

    # Remove duplicates from high-level and low-level keywords
    high_level_keywords = list(set(high_level_keywords))
    low_level_keywords = list(set(low_level_keywords))

    # Return the keywords in JSON format
    return {
        "high_level_keywords": high_level_keywords,
        "low_level_keywords": low_level_keywords
    }

query = "How did urbanization influence the average household size in Bhubaneswar?"
result = extract_keywords(query)

print(json.dumps(result, indent=4))

Output:

{
    "high_level_keywords": ["Urbanization", "Average household size"],
    "low_level_keywords": ["Influence", "Bhubaneswar"]
}

This script first tokenizes the query into individual words and then identifies high-level and low-level keywords. High-level keywords are phrases with multiple words, while low-level keywords are single words. The stop_words list is used to exclude common words like "the", "and", etc. that do not add much value to the query. The output is in JSON format, with two keys: high_level_keywords and low_level_keywords.

LarFii commented 2 days ago

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

Luobots commented 2 days ago

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

Oh, this "approach" is just a text generated by my LLM when using it to extrat keyword (See Here in Your Code) 😂, I just want to emphasize the "{" * 2 situation (See Here in Your Code) can be solved by my PR.