Open Fardeen786-eng opened 4 months ago
Hi. There is a prompt for generating the problems in Appendix E.4, you can use them as a starting point to generate problems.
Frankly we did not create a toolchain to automate this semi-automated process. Our process is quite ad-hoc. The problems are generated via ad-hocly written prompt scripts and annotators are instructed to review and revise the problems in vscode. If you think any details are missing from the appendix, please kindly let me know.
Hi, I appreciate the efforts to develop a benchmark to evaluate ML agent systems in every state possible. But, I am most curious about the annotation process that was used to create these benchmarks. I think it was one of the TODOs for getting tutorials for Developing New Benchmarks. I went through the paper https://arxiv.org/pdf/2402.17168 correct me if I am wrong we use the 31 Kaggle datasets and available notebooks to come up with certain problem sketches which we then convert into individual problems (query, validator, etc) that form 1 problem set in our benchmark. Could I get more insights into this process and how we used LLMs to come up with them and refine them through human annotation?