Add cost tracking to dataset generator and allow dataset size control

alexander-zuev commented 6 days ago

Describe the Feature There are 2 problems with the current dataset generator approach:

it doesn't support cost tracking
it chunks and embeds the full document EVEN if the testset_siz == 1. The problem is that I, as a user, want to be able to efficiently generate smaller datasets, while current approach doesn't consider the testset_size when chunking and embedding the documents

Why is the feature important for you?

Cost tracking is important to be able to monitor evaluation costs end-to-end: a) First part - costs of dataset generation (not supported) b) Second part - costs of a evaluation runs (already supported)
More efficient dataset generation approach is important because currently dataset generation can be extremely inefficient as it doesn't consider the testset_size. There has to be a way to decide in a smart way whether to chunk & embed the full document or not based on the number of test questions needed to be generated.

Additional context Add any other context about the feature you want to share with us.

shahules786 commented 6 days ago

Hey @Twist333d Very valid concerns. 1) We will plan and add support to cost estimation for test generation soon. 2) While efficiently generating ( and regenerating) test data points from a single document set, we also have to make sure that the generated points were not already generated before. So we intend to make this process efficient by doing a one-time preprocessing of documents and then letting you persist with the intermediate form (KG). In that case, one could repeatedly sample data points from the same corpus w/o redoing the preprocessing step. What do you think?

We will be continuously improving the new test gen, and would love to chat and understand more from you. https://cal.com/shahul-ragas/30min

alexander-zuev commented 6 days ago

@shahules786 thanks for a prompt response! I think what you describe does indeed address the two core concerns I had:

be able to monitor costs for both dataset generation and evaluation
ensure efficient dataset generation even if testsize is small

On the proposal above, how would a user manage the intermediate form? Would I need to save it / manage it locally?

explodinggradients / ragas

Add cost tracking to dataset generator and allow dataset size control #1506