Open binghangli378 opened 4 months ago
@binghangli378 that is a very interesting Idea, reopening this to track more. Would you still like to help out on this?
@shahules786 something we can consider for #1010 ?
@binghangli378 Yes, this is very interesting. From what I observed they have a few more metrics that are not available in Ragas (note I just added #1174), I think the two metrics that would be beneficial are 1) self-knowledge: this would be something like a 1 - faithfulness score. Uses to measure how much of the generated response contains knowledge from LLM itself. 2) noise sensitivity: this is more interesting, I think what they are trying to achieve is
number of incorrect claims in the generated answer that came from irreverent chunks / total number of claims in the answer
This could be used to understand how bad noise in the context is affecting the quality of the generated answer. I also found this paper showing noise in retrieved-context effects answer quality.
tagging you guys in case if you're interested in contributing. I have added them to the metrics roadmap. @sky-2002 @vaishakhRaveendran
I can take up noise-sensitivity
In fact, we had discussed something similar what I was referring to as attributing each claim in answer to some context.
@sky-2002 Sure, Can you please comment in this issue so that I can assign it to you?
Recently, I have noticed something similar to ragas, which is the RagChecker. It provides a new perspective to evaluate RAG pipelines which separately focuses on:
This new perspective will provide a more detailed evaluation of the model’s performance, allowing for a deeper understanding of how different types of data chunks impact the evaluation process.
I am willing to design some new evaluation methods based on it. Please let me know if you are open to this idea, and I can provide further assistance or code examples.