Closed teny19 closed 5 months ago
following are tokens length tokenized by nltk package.
embedding model = Alibaba-NLP/gte-large-en-v1.5 question generation model = voidful/context-only-question-generator no of questions per chunk = 2 no of similarity filtered by retrieval = 5 book = "Art of Happiness at work by Dalai Lama" chunking method = sentence splitter from llamaindex
Description:
Taking the preprocessed documents as input and creating chunks to vectorize and populate into the vector database. Chunking methods to try out are SentenceSplitter (defining chunk size and chunk overlap) and SemanticSplitter. Extracting the metadata and mapping them to chunks.
Excepted Output:
Chunks with metadata information that have reasonable size so that they can fit into the context of the LLM prompt.
LLM Prompt Context Length:
Implementation Plan
Implementation tasks