setting chunk size of chunk_corpus function

OSU-NLP-Group / HippoRAG

[NeurIPS'24] HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.

MIT License

1.35k stars 114 forks source link

def chunk_corpus(corpus: list, chunk_size: int = 64) -> list: """ Chunk the corpus into smaller parts. Run the following command to download the required nltk data: python -c "import nltk; nltk.download('punkt')"

@param corpus: the formatted corpus, see README.md
@param chunk_size: the size of each chunk, i.e., the number of words in each chunk
@return: chunked corpus, a list
"""

the default chunk_size is 64, is that the best practice? I tried with 150, and the entity count is the same as 64, but 10% more relationships were obtained.

OSU-NLP-Group / HippoRAG

setting chunk size of chunk_corpus function #56