Chunking strategy and answer relevance vs. Docs() object size ?

Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations

Apache License 2.0

6.33k stars 599 forks source link

Smaller chunk sizes can indeed improve the relevance of answers by providing more focused context, but they can also lead to larger Docs() objects, especially with a large number of documents. The optimal chunking strategy in the "high quality config" suggests a chunk size of 7000 with an overlap of 250, which balances context size and continuity between chunks ^[1].

To manage the size of Docs() objects while maintaining relevance, consider parallelizing the processing of documents. The gather_with_concurrency function in paper-qa allows for concurrent processing, which can help manage memory usage and improve performance ^[2]^[3]. Adjusting the max_concurrent_requests setting can also help optimize resource usage based on your system's capabilities.

To continue talking to Dosu, mention @dosu.

Future-House / paper-qa

Chunking strategy and answer relevance vs. Docs() object size ? #605