MetaCopilot / dseval

https://metacopilot.github.io/dseval/
MIT License
20 stars 4 forks source link

DSEAL: The annotation process employed for the benchmark creation #8

Open Fardeen786-eng opened 3 months ago

Fardeen786-eng commented 3 months ago

Hi, I appreciate the efforts to develop a benchmark to evaluate ML agent systems in every state possible. But, I am most curious about the annotation process that was used to create these benchmarks. I think it was one of the TODOs for getting tutorials for Developing New Benchmarks. I went through the paper https://arxiv.org/pdf/2402.17168 correct me if I am wrong we use the 31 Kaggle datasets and available notebooks to come up with certain problem sketches which we then convert into individual problems (query, validator, etc) that form 1 problem set in our benchmark. Could I get more insights into this process and how we used LLMs to come up with them and refine them through human annotation?

ultmaster commented 3 months ago

Hi. There is a prompt for generating the problems in Appendix E.4, you can use them as a starting point to generate problems.

Frankly we did not create a toolchain to automate this semi-automated process. Our process is quite ad-hoc. The problems are generated via ad-hocly written prompt scripts and annotators are instructed to review and revise the problems in vscode. If you think any details are missing from the appendix, please kindly let me know.