RAG0002: Split preprocessed documents into text chunks (2)

teny19 commented 6 months ago

Description:

Taking the preprocessed documents as input and creating chunks to vectorize and populate into the vector database. Chunking methods to try out are SentenceSplitter (defining chunk size and chunk overlap) and SemanticSplitter. Extracting the metadata and mapping them to chunks.

Excepted Output:

Chunks with metadata information that have reasonable size so that they can fit into the context of the LLM prompt.

LLM Prompt Context Length:

Phi-3-mini: 4k and 128k
Llama3-8B: 8k

Implementation Plan

Implementation tasks

[x] examine with SemanticSplitter
[x] examine with SentenceSplitter
[x] evaluate SentenceSplitter
[x] script to map sentence chunking with metadata

teny19 commented 6 months ago

will use sentence splitter and semantic chunker for now
when using semantic chunker , it is difficult to estimate the chunk size and few cases have been identified where one context (e.g. example stories) have been split up into two chunks
will optimize the approach at later stage as more insights are gathered

tenzin3 commented 6 months ago

Output of sentence splitter on Book "Art of happiness at work" with different chunk sizes and overlap

following are tokens length tokenized by nltk package.

tenzin3 commented 6 months ago

Result after retrieving the relevant chunks using llamaindex

embedding model = Alibaba-NLP/gte-large-en-v1.5 question generation model = voidful/context-only-question-generator no of questions per chunk = 2 no of similarity filtered by retrieval = 5 book = "Art of Happiness at work by Dalai Lama" chunking method = sentence splitter from llamaindex

OpenPecha / rag_prep_tool