OSU-NLP-Group / HippoRAG

[NeurIPS'24] HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.
https://arxiv.org/abs/2405.14831
MIT License
1.35k stars 114 forks source link

setting chunk size of chunk_corpus function #56

Open doncat99 opened 1 month ago

doncat99 commented 1 month ago

def chunk_corpus(corpus: list, chunk_size: int = 64) -> list: """ Chunk the corpus into smaller parts. Run the following command to download the required nltk data: python -c "import nltk; nltk.download('punkt')"

@param corpus: the formatted corpus, see README.md
@param chunk_size: the size of each chunk, i.e., the number of words in each chunk
@return: chunked corpus, a list
"""

the default chunk_size is 64, is that the best practice? I tried with 150, and the entity count is the same as 64, but 10% more relationships were obtained.

yhshu commented 4 days ago

The best practice depends on your scenarios. I'd suggest use the default value first and see the performance. Then, gradually tune this number.