Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.33k stars 599 forks source link

Chunking strategy and answer relevance vs. Docs() object size ? #605

Open Snikch63200 opened 2 weeks ago

Snikch63200 commented 2 weeks ago

Hello,

I'm looking for an optimal chunking strategy to get relevant answers for my queries.

I tried these parmeters provided in "high quality config" :

It gives poor results...

This config seems giving better results :

The problem is I work on a large set of documents (more than 1500) with sometimes huge documents (more than 300 pages). Small chunk size with recommended 10-20 % overlap gives huge Docs() object...

Are smaller chunk sizes better than biggest ones in terms of relevance ?

Is there a way to not create huge objects ?

Best regards.

dosubot[bot] commented 2 weeks ago

Smaller chunk sizes can indeed improve the relevance of answers by providing more focused context, but they can also lead to larger Docs() objects, especially with a large number of documents. The optimal chunking strategy in the "high quality config" suggests a chunk size of 7000 with an overlap of 250, which balances context size and continuity between chunks [1].

To manage the size of Docs() objects while maintaining relevance, consider parallelizing the processing of documents. The gather_with_concurrency function in paper-qa allows for concurrent processing, which can help manage memory usage and improve performance [2][3]. Adjusting the max_concurrent_requests setting can also help optimize resource usage based on your system's capabilities.

To continue talking to Dosu, mention @dosu.