explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.2k stars 471 forks source link

Testset Generation: Random node selection improvements #641

Closed gleason-m closed 1 day ago

gleason-m commented 3 months ago

Describe the Feature The TestsetGenerator currently randomly selects nodes from the docstore with replacement. This means the same node may have multiple testsets generated for it while other nodes have none. Requesting that this be revisited to see if it makes sense to choose nodes from the docstore without replacement.

Why is the feature important for you? For my usecase, I want to generate a question for each document I provide. This isn't possible without overriding the docstore's get_random_nodes implementation to choose without replacement. E.g.:

import numpy as np
from ragas.testset.docstore import InMemoryDocumentStore, Node
from ragas.testset.utils import rng

class NoReplacementInMemoryDocumentStore(InMemoryDocumentStore):
    def get_random_nodes(self, k=1) -> List[Node]:
        node_copies = k // len(self.nodes)
        remainder = k % len(self.nodes)

        selected_nodes = self.nodes * node_copies
        if remainder == 0:
            return selected_nodes

        random_nodes = rng.choice(
            np.array(self.nodes),
            size=remainder,
            replace=False
        ).tolist()

        selected_nodes.append(random_nodes)
        return selected_nodes
shahules786 commented 3 months ago

Hey @gleason-m thanks for the feedback. I am revisiting node selection mechanism now, I'll also keep this in mind.