How to use highly capable Decoder only models (LLMs) with RAGatouille -- is it even advisable?

AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.

Apache License 2.0

3.01k stars 203 forks source link

How to use highly capable Decoder only models (LLMs) with RAGatouille -- is it even advisable? #231

Closed brando90 closed 2 months ago

brando90 commented 3 months ago

I was curious, given the highly open capable models like llama3 family of the need for specialized models for mathematics (e.g., DeepSeekMath or formal mathematics like Lean4, Coq, Isabelle that need fine-tuning to produce good embeddings in the first place likely) -- how can I use decoder models?

And actually, is it even advisable?

Reference:

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe https://arxiv.org/abs/2406.04165

bclavie commented 2 months ago

Hey! This currently hasn't been done, but there's technically nothing stopping you from using a repurposed decoder as the backbone for a ColBERT model. The important thing would be to train it so that it can produce ColBERT-style representation, but it should work just as well (or better!) as it does for dense embeddings.

brando90 commented 2 months ago

cool! @bclavie do you have any advice for how to build good embedding methods for mathematics? What is the best way to go about training ColBERT model in your opinon?