OpenPecha / rag_prep_tool

MIT License
0 stars 0 forks source link

RAG0002: Split preprocessed documents into text chunks (2) #2

Closed teny19 closed 4 weeks ago

teny19 commented 1 month ago

Description:

Taking the preprocessed documents as input and creating chunks to vectorize and populate into the vector database. Chunking methods to try out are SentenceSplitter (defining chunk size and chunk overlap) and SemanticSplitter. Extracting the metadata and mapping them to chunks.

Excepted Output:

Chunks with metadata information that have reasonable size so that they can fit into the context of the LLM prompt.

LLM Prompt Context Length:

Implementation Plan

Image

Implementation tasks

teny19 commented 1 month ago
tenzin3 commented 1 month ago

Output of sentence splitter on Book "Art of happiness at work" with different chunk sizes and overlap

following are tokens length tokenized by nltk package.

Image

tenzin3 commented 1 month ago

Result after retrieving the relevant chunks using llamaindex

embedding model = Alibaba-NLP/gte-large-en-v1.5 question generation model = voidful/context-only-question-generator no of questions per chunk = 2 no of similarity filtered by retrieval = 5 book = "Art of Happiness at work by Dalai Lama" chunking method = sentence splitter from llamaindex

Image