AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.08k stars 210 forks source link

Idea: Make CorpusProcessor (and splitter_fn / preprocessing_fn) to have access to metadata #217

Open stronk7 opened 5 months ago

stronk7 commented 5 months ago

Please, amend me if I'm wrong, I'm a newbie here!

I've seen that, by default, here we are using llamaindex SentenceSplitter and, we can switch to use any of the other Splitters available there. One of the interesting things supported by almost all those splitters/readers/node parsers is that, apart from providing the chunks using different techniques, they also come with some interesting metadata, and things like:

But it seems, again, I could be wrong, that we are simply discarding that information, that sometimes could be really nice to have (to improve indexing, retrieving, filtering...).

So, maybe, it wouldn't be crazy to make the CorpusProcessor aware of the document_metadatas, and then, allow both to the splitter_fn and preprocessing_fn to pass and modify that information if desired to.

Real case example:

  1. I've a good collection of markdown files.
  2. Have been doing some experiments with them (mainly measuring indexing times, using different colbert models... but that's another story, heh, some day will share some results, they behave really different from my early/basic tests ...)
  3. When testing the retrieving and results, I've observed that, depending of the chunking, some obvious passages, that are continuation of other highly ranked ones aren't detected at all. For example (very simplified example) I've this document (markdown):
    
    ...
    ## RAGAtouille is great
    ### Why is it so great.
    Ragatouille is great by a number of reason, and here there are a few:
  4. It makes the Colbert (v2) easily playable.
  5. It's simple and don't require much expertise to be able to use it.
  6. Allows to compare different models with ease.
  7. It really works.
  8. Very active and vibrant community. ... ...
    
    And, by coincidence, the split happens around point 3. So, we have passage 1 with the headers, the intro text and point 1-2-3, and then, passage 2, with points 3 (overlap)-4-5 and the next content in the chunk.

Later, when you ask to the index "Why is RAGAtouille so cool?" the very first passage returned is passage 1, but passage 2 is never returned (or falls really down in the ranking), it doesn't know how to connect passage 1 and passage 2.

But if, instead, we had the metadata calculated by the splitters (prev/next, headers...) then, for sure, in the preprocessing function we could decide what to do with it, from adding some metadata (say the headers) to the document content to be indexed, say try to join passages and split them using some other technique, whatever!

But, to be able to do that, we really need the CorpusProcessor and the 2 customisable functions to be "metadata-aware", right now they aren't, and the only alternative is to proceed with the chunking and pre-processing out from/before RAGAtouille code, and then just pass all the processed information to it.

Heh, not sure if I'm explaining my idea clearly enough, I hope so. Basically, if we are going to get benefits from reusing llamaindex splitters (I'm looking forward using the MarkdownNodeParser that is 99% perfect for my contents), let's allow RAGAtouille to get a better benefit from them. Or, if anybody wants to create their very own splitters/post-processors (different from llamaindex ones), let's give access to the metadata too.

Many thanks for this tool, it has allowed me to try a lot of things with Colbert, that is an approach I wanted to experiment with since long ago, I love those "ultra-dense" (BoW) vectors and all the ideas behind them for better retrieval.

Ciao :-)