(spike) Identify Open Source Datasets for RAG Evaluations - Githubissues

defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.

https://leapfrog.ai

Apache License 2.0

254 stars 29 forks source link

(spike) Identify Open Source Datasets for RAG Evaluations #715

Closed jalling97 closed 2 months ago

jalling97 commented 3 months ago

Description

We need to choose a small number (1-3, depending on size) of open source RAG evaluation datasets. Having at least 1 open source dataset allows us to begin running basic evaluations that allow us to begin iterating on improvements.

Relevant Links

Many datasets can be found on HuggingFace

jalling97 commented 3 months ago

A few options/references to take a look at:

jalling97 commented 2 months ago

Relevant to here: https://github.com/defenseunicorns/leapfrogai/issues/716#issuecomment-2214386582

jalling97 commented 2 months ago

Based on popular guides for setting up RAG datasets, most lean towards performing synthetic dataset generation, rather than sharing out specific datasets that contain questions, answers, and context. Going forward, it will likely make the most sense to make some sample datasets using the contextual documents and generating them. Some potential options include:

jalling97 commented 2 months ago

For Needle In a Haystack (NIAH) evaluations, we can leverage some existing datasets for this, to potentially include:

https://huggingface.co/datasets/nanotron/simple_needle_in_a_hay_stack
https://huggingface.co/datasets/lvwerra/needle-llama3-16x512 (similar setup to the nanotron dataset, but split into multiple datasets based on context length)

jalling97 commented 2 months ago

These proved to be pretty good, but I think it would be more advantageous to generate some simple examples instead.

jalling97 commented 2 months ago

Final determination:

Of the open source datasets available, it was found to be more valuable to create a few datasets from scratch, so no open source datasets will be used for RAG evaluations specifically.