Open Snikch63200 opened 2 weeks ago
Smaller chunk sizes can indeed improve the relevance of answers by providing more focused context, but they can also lead to larger Docs()
objects, especially with a large number of documents. The optimal chunking strategy in the "high quality config" suggests a chunk size of 7000 with an overlap of 250, which balances context size and continuity between chunks [1].
To manage the size of Docs()
objects while maintaining relevance, consider parallelizing the processing of documents. The gather_with_concurrency
function in paper-qa allows for concurrent processing, which can help manage memory usage and improve performance [2][3]. Adjusting the max_concurrent_requests
setting can also help optimize resource usage based on your system's capabilities.
To continue talking to Dosu, mention @dosu.
Hello,
I'm looking for an optimal chunking strategy to get relevant answers for my queries.
I tried these parmeters provided in "high quality config" :
It gives poor results...
This config seems giving better results :
The problem is I work on a large set of documents (more than 1500) with sometimes huge documents (more than 300 pages). Small chunk size with recommended 10-20 % overlap gives huge Docs() object...
Are smaller chunk sizes better than biggest ones in terms of relevance ?
Is there a way to not create huge objects ?
Best regards.