defog-ai / sql-eval

Evaluate the accuracy of LLM generated outputs
Apache License 2.0
448 stars 47 forks source link

Update benchmark questions (1/2) #178

Closed wongjingping closed 2 weeks ago

wongjingping commented 2 weeks ago

We keep only advising (for its date columns), atis (for its unix timestamp columns), yelp (for its year / month columns). We delete academic's and scholar's date_functions questions, which we will replace with questions from the 4 new schema in a subsequent PR. This is because academic and scholar are semantically similar to advising, and is a repeat of the year/month-syntax questions in yelp.

Other single-question changes:

How many reviews were written for businesses located in California in the last 10 months?

Updated this date_functions question to use the actual date ranges in the data

Return the course id's that are offered in either semesters 1 or 2 and ends before 1pm and had an instructor on thursday

Modified 1 question in advising to filter on time and day-of-week column since no other questions were testing for those columns in the advising schema.

Will make all of the changes before translating them over to the other dialects in 1 go.

rishsriv commented 2 weeks ago

Thank you! This looks good to me. Okay if we merge this along with the other upcoming PR with new questions added in? That way, we'll keep the current 25 questions for the date functions benchmark until the merge is done

wendy-aw commented 2 weeks ago

Thanks for the changes! I'll wait for you to complete the changes on this main set of questions before clarifying some other questions. Meanwhile I'll make changes to the translate script and dialects.py cos some errors were slipping thru (Thanks for spotting all those rishabh!)

wongjingping commented 2 weeks ago

Added 5 questions for broker and car_dealership each according to the following question types:

broker:

car_dealership:

Let me know if we'd prefer to add other types of date queries here!

Will add 10 more for the other 2 schema later~

rishsriv commented 2 weeks ago

Thank you! Really appreciate the work on making the date functions evals more representative of actual usage :D