Review the licenses for all datasets

davidsbatista commented 4 months ago

davidsbatista commented 4 months ago

Regarding the 2 datasets for which we already have evaluation code working.

The ARAGOG dataset:

is a subset from this set: https://huggingface.co/datasets/jamescalam/ai-arxiv
See Section 3.1 (https://arxiv.org/pdf/2404.01037): "This study utilizes a tailored dataset derived from the AI ArXiv collection, accessible via Hugging Face (James Calam, 2023). The dataset consists of 423 selected research papers centered around the themes of AI and LLMs, sourced from arXiv. This selection offers a comprehensive foundation for constructing a database to test the RAG techniques and creating a set of evaluation data to assess their effectiveness."
we are using the questions/answers pairs which were human-validated, we should cite the paper, and that should be enough.

The SQuAD dataset:

mrm1001 commented 4 months ago

For the mini ESG dataset:

The main files are all public:

deepset-ai / haystack-evaluation