explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.63k stars 649 forks source link

Random seed fixed. #1140

Closed Gwenn-LR closed 1 month ago

Gwenn-LR commented 1 month ago

Describe the bug While looking for the answer to the issue #1138, I've noticed my generation issue was coming from the generation of an empty row. Across many tests, I've always got this issue and when I've tried to generate a second row, I've noted it was always the same generated row. Looking over our repository, I've finally found that you were fixing the numpy.random's seed,

Ragas version: 0.1.11 Python version: 3.10.12

Code to Reproduce Here is a script that takes your implementation and illustrates the random choice behaviour when we relaunch an experiment:

import numpy as np

if __name__ == "__main__":
    test = [0, 1, 2, 3]

    for exp_num in range(3):
        print(f"Experiment n°{exp_num}")
        rng = np.random.default_rng(seed=42)
        for choice_num in range(10):
            print(f"\tChoice n°{choice_num}: {rng.choice(test)}")
Error trace Experiment Choices
1 0, 3, 2, 1, 1, 3, 0, 2, 0, 0
2 0, 3, 2, 1, 1, 3 ,0 ,2 ,0, 0
3 0, 3, 2, 1, 1, 3 ,0 ,2 ,0, 0

Expected behavior If a user does not give any seed, the generation should be totally random.

Additional context I'll submit a PR integrating this seed and numpy.random.Generator along with the RunConfig and use it accordingly wherever this rng was used.

dosubot[bot] commented 1 month ago

To address the issue of generating an empty row due to the fixed random seed in numpy.random, you can modify the code to allow for an optional seed. If no seed is provided, the generation will be random. Here's how you can integrate this with the RunConfig class:

  1. Update RunConfig to include an optional seed parameter:
@dataclass
class RunConfig:
    """
    Configuration for timeouts, retries, and random seed.
    """
    timeout: int = 60
    max_retries: int = 10
    max_wait: int = 60
    max_workers: int = 16
    thread_timeout: float = 80.0
    exception_types: t.Union[
        t.Type[BaseException],
        t.Tuple[t.Type[BaseException], ...],
    ] = (Exception,)
    log_tenacity: bool = False
    seed: t.Optional[int] = None  # Add seed parameter
  1. Modify the SimpleEvolution class to use the seed from RunConfig:
class SimpleEvolution(Evolution):
    seed_question_prompt: Prompt = field(default_factory=lambda: seed_question_prompt)

    def __init__(self, run_config: RunConfig):
        self.run_config = run_config
        if self.run_config.seed is not None:
            np.random.seed(self.run_config.seed)

    async def _aevolve(
        self, current_tries: int, current_nodes: CurrentNodes
    ) -> EvolutionOutput:
        assert self.docstore is not None, "docstore cannot be None"
        assert self.node_filter is not None, "node filter cannot be None"
        assert self.generator_llm is not None, "generator_llm cannot be None"
        assert self.question_filter is not None, "question_filter cannot be None"

        merged_node = self.merge_nodes(current_nodes)
        passed = await self.node_filter.filter(merged_node)
        if not passed["score"]:
            current_nodes = self._get_new_random_node()
            return await self.aretry_evolve(
                current_tries, current_nodes, update_count=False
            )

        logger.debug("keyphrases in merged node: %s", merged_node.keyphrases)
        results = await self.generator_llm.generate(
            prompt=self.seed_question_prompt.format(
                context=merged_node.page_content,
                keyphrase=np.random.choice(np.array(merged_node.keyphrases), size=1)[0],
            )
        )
        seed_question = results.generations[0][0].text
        logger.info("seed question generated: %s", seed_question)
        is_valid_question, feedback = await self.question_filter.filter(seed_question)

        if not is_valid_question:
            seed_question, current_nodes = await self.fix_invalid_question(
                seed_question, current_nodes, feedback
            )
            logger.info("rewritten question: %s", seed_question)
            is_valid_question, _ = await self.question_filter.filter(seed_question)
            if not is_valid_question:
                current_nodes = self._get_new_random_node()
                return await self.aretry_evolve(current_tries, current_nodes)

        return seed_question, current_nodes, "simple"

    def __hash__(self):
        return hash(self.__class__.__name__)

    def adapt(self, language: str, cache_dir: t.Optional[str] = None) -> None:
        super().adapt(language, cache_dir)
        self.seed_question_prompt = self.seed_question_prompt.adapt(
            language, self.generator_llm, cache_dir
        )

    def save(self, cache_dir: t.Optional[str] = None) -> None:
        super().save(cache_dir)
        self.seed_question_prompt.save(cache_dir)
  1. Example usage:
run_config = RunConfig(seed=None)  # No seed for random generation
simple_evolution = SimpleEvolution(run_config=run_config)

By making these changes, you ensure that the random seed is optional. If a seed is provided, the results will be reproducible; otherwise, the generation will be random [1].

To continue talking to Dosu, mention @dosu.