Is your feature request related to a problem? Please describe.
The current SWE-bench is probably too difficult to see the effect of scaffolding. We can gather ~20 easy tasks within them to measure the progress of our scaffolding/prompting effect.
Describe the solution you'd like
We can use the # of lines in the golden patch as a criteria for difficulty at this moment.
Is your feature request related to a problem? Please describe. The current SWE-bench is probably too difficult to see the effect of scaffolding. We can gather ~20 easy tasks within them to measure the progress of our scaffolding/prompting effect.
Describe the solution you'd like We can use the # of lines in the golden patch as a criteria for difficulty at this moment.