Any form of caching / reusing data while generating a synthetic dataset?

NiklasClausius commented 5 months ago

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Hi, I'm currently getting into evaluation of my RAG-system but had a problem when following the Get started guide for Generation of a synthetic dataset.

Your Question Is there a way to save steps across generation of a synthetic dataset? I have a fairly large dataset to load (I pickled the docs for later use which saves quite some time) and I'm using a local embedding model on machine with just a CPU resulting in a long time building the embeddings during the Testset Generation. Now i had an exception thrown because of exceeding the API-limit in the generation phase resulting in a loss of all data i guess. Is there a way to persists the state after building the embeddings? It seems like at this point there have already been a lot of calls to the generator LLM as well. From the "getting-started" and Core concept docs i do not really understand if there is a way to achieve a persistent state.

It would be great if i can just restart the generation part again for the next try. Does Ragas have a functionality for that? I hope i was able to explain my problem.

I'm really thankful for any help :)

dosubot[bot] commented 5 months ago

Hey @NiklasClausius! 👋 I'm here to help you with any bugs or questions you have while waiting for a human maintainer.

Yes, there is a way to save steps across the generation of a synthetic dataset in Ragas to avoid losing progress, especially after building embeddings, and to restart the generation part if an exception occurs. The save method in the TestDataset class allows you to save the state of the document store and evolutions to a specified path. This can be used to save progress and reload it if an exception occurs.

Here is the relevant code snippet:

def save(
    self, evolutions: t.List[Evolution], cache_dir: t.Optional[str] = None
) -> None:
    """
    Save the docstore prompts to a path.
    """
    assert isinstance(
        self.docstore, InMemoryDocumentStore
    ), "Must be an instance of in-memory docstore"
    assert self.docstore.extractor is not None, "Extractor is not set"

    self.docstore.extractor.save(cache_dir)
    for evolution in evolutions:
        assert evolution.node_filter is not None, "NodeFilter is not set"
        assert evolution.question_filter is not None, "QuestionFilter is not set"
        if isinstance(evolution, ComplexEvolution):
            assert (
                evolution.evolution_filter is not None
            ), "EvolutionFilter is not set"
        evolution.save(cache_dir=cache_dir)

This method ensures that the state of the document store and each evolution is saved, which can then be reloaded to resume the generation process if needed ^[1].

To continue talking to Dosu, mention @dosu.

jjmachan commented 4 weeks ago

@NiklasClausius I know it's been a while but do keep an eye on https://github.com/explodinggradients/ragas/issues/1602, will be adding it this week 🙂

explodinggradients / ragas

Any form of caching / reusing data while generating a synthetic dataset? #1044