explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.33k stars 607 forks source link

I have one doubt we are using critic llm in evaluation is it differnetn from the llm we used in rag architecture? #1092

Closed Saurabh8255 closed 3 weeks ago

Saurabh8255 commented 1 month ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question what is unclear to you? What would you like to know?

Code Examples This community speaks code. Share your code snippets to help us understand your question better.

Additional context Anything else you want to share with us?

dosubot[bot] commented 1 month ago

It seems like you have a question about the Ragas library. Could you please provide more details about what you're unclear about or what specific information you're looking for? If you have any code snippets or additional context, that would also be helpful in understanding your question better.

To continue talking to Dosu, mention @dosu.

Gwenn-LR commented 1 month ago

@Saurabh8255 I was wondering about a similar issue, the usefulness of the critic in the TestsetGenerator.from_langchain method (I infer you were talking about this method since there is no critic parameter in the evaluate method) and after having looked for it in the repository, I can't find any reference to the critic parameter apart from the definition of the TestsetGenerator.critic_llm attribute.

I'll let @jjmachan or @shahules786 bring a clear answer and conclude this issue, but I think it might be a relic of a previous version.

Saurabh8255 commented 1 month ago

After some research I found that it will be better if we use different LLM as critic but make sure that it will me larger model than we used in our RAG architecture. for better evaluaion.

Gwenn-LR commented 1 month ago

Yeah, you are theorically right but also empirically right ! I couldn't find where critic_llm was used but I have finally managed to find it : it is used in TestsetGenerator.init_evolution:l226-232 to define filters as described in the guide Using Ragas Critic Model instead of GPT-4.

jjmachan commented 4 weeks ago

I see you folks have figure out the answer 🙂 but I'll add my thoughts here too. The reason why there is both generator and critic models is that we found out when developing that for the critic parts of the pipeline (which decides if the generated question or the chosen contexts if useful for further processing) needs a better model for more sterability so we have more control over specifing what the correct parameters are.

We did a small experiment to make a custom model for this which @Gwenn-LR has shared. but we are revamping the testset generation tools with #1016

Gwenn-LR commented 4 weeks ago

Thanks for your answer @jjmachan but also all the amazing work of your team !

Since we are discussing of the differences between the LLM used in your package, I'll allow myself to ask another question concerning this matter. Do not hesitate to ask me to express it in another issue, it will be done at your convenience. Otherwise here my thoughts (accessible through the spoiler dropdown):
After analyzing your repository, I've came to the conclusion there is potentially up to 6 models that can be used end-to-end: - 1 embedding function as the attribute of the synthetic generator `TestsetGenerator.embeddings` and the 1 for the data vectorization in the RAG pipeline to evaluate (here named `rag_embedding`). - 2 LLM as discussed previously as attributes of this generator `TestsetGenerator`, `generator_llm` and `critic_llm`, 1 used during evaluation with any metrics inheriting from `MetricWithLLM`, `llm` and finally one from the RAG pipeline itself (here named `rag_llm`). I haven't realized a exploration of those parameters and the consequences of their correlation and I think sometimes a single model might be used for different attributes, but for others it might be better to uncouple those ones. I've finally came to an organization based on LLM size and robustness (practical base of evaluation for a common user) and I was wondering if you could send me your thoughts about it: - `embeddings` =? `rag_embedding` - `critic_llm` = `llm` >= `generator_llm` > `rag_llm` I think embeddings even through different objectives should define a single domain. However, having a embedding for the generation of your synthetic dataset different from the one used in the RAG user pipeline migth also introduce a more diverse generation of data, which could be benefical to the overall evaluation. So I can't reach any clear conclusion with my current knowledge about those. Since the objective of the `critic_llm` is the same as `llm` that is to say the evaluation of different generation (the triplet `question` <-> `context` <-> `ground_truth` for `critic_llm` and the respective metric for `llm`), I think the best available model should be used there. The `generator_llm` is used to generate the synthetic dataset, so the better, the more accurate and diverse should be your synthetic dataset. Moreover, if you (or more specifically, your GPU) can't handle the loading of more than 2 predefined model, first models that should be separated are the `generator_llm` and your `rag_llm` since the evaluation of the former's generation depends on the quality of the latter's generation. Furthermore, their is a good chance that your RAG is supposed to be used inside a service and should be as light and quick as possible while the synthetic dataset only has to be generated upstream. To conclude: 1. Focus on your `rag_llm` definition. 2. Use a "better" LLM for `generator_llm` if you need to generate a synthetic dataset. 3. If you don't and/or if you want to use a third one for every evaluation - of the synthetic generation and the metric - you should use a `critic_llm` and a `llm` which are among the best available.

I'm eager to read your opion about my analysis and to see the evolution of this package. Do not hesitate to ask for any help if needed, I would be happy to lend you a helping hand !

Have a nice day !

jjmachan commented 3 weeks ago

@Gwenn-LR thank you for the writeup but it was a bit long so I didn't quite understand what you actually had in mind so correct me if I'm wrong

but for this

embeddings =? rag_embedding critic_llm = llm >= generator_llm > rag_llm

the correct order is embeddings = rag_embeddings (right now since we don't see a lot of diff in perf and we don't have a big conponent of retriever) it maybe said that embedding < rag_embedding because the actual usecase for retriever is not that much

2nd one is roughly correct but only because of the current setup. generator_llm = rag_llm though. critic_llm can be a smaller fine-tuned model

does address all of your thoughts?

Gwenn-LR commented 3 weeks ago

Thanks for your answer and yes it clearly address my point!

github-actions[bot] commented 3 weeks ago

It seems the issue was answered, closing this now.