MultiVectorRetriever does not obey self.chunk_size

Since the process follows like this:

Separate inputs into pages to make parent docs
For each page, create a summary/qa list
Create a list of child_docs from these summaries
The vector database stores these shortened docs, and aliases them to the parent documents

These parent docs are prioritized to "page" correctly, like for a pdf. This may lead to unresolved context limits or unexpected behavior without better checks on the size of doc.page_content. Since it's the parent document excerpts getting passed to the LLM at the very end, we need to calculate character count and respect the token limits.

TanGentleman / Augmenta

MultiVectorRetriever does not obey self.chunk_size #10