AlibabaResearch / DAMO-ConvAI

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.
MIT License
1.1k stars 178 forks source link

BIRD dataset fixes for database descriptions & leaderboard update request (gpt4 and others) #52

Closed Magolor closed 12 months ago

Magolor commented 1 year ago

Currently, I'm only playing with BIRD's dev set. Just found the following problems occurred in the dev set in database_description:

  1. Encoding Issue with .csv files: These files are claimed to be utf-8 encoding with BOM, but many contain non-utf-8 characters. For example, in the formula_1 database's qualifying table, there's an invisible character in the first row's value_description:

    ... Sprint qualifying is essentially a short-form Grand Prix � a race that ...

    This causes the pandas csv reader to fail. The .csv files' BOM start prevents using other encodings like latin1 or iso-8859-1. The current workaround is to delete these invisible characters in multiple files manually.

  2. Schema Mismatch in .csv and .sqlite files: Some .csv files in database_description don't match the .sqlite table schema. For example, in the european_football_2 database, the table contains columns named home_player_<number>, absent in the .csv files. These files only contain home_player_X<number> and home_player_Y<number>, which are also not well-described.

  3. Incorrect .csv File Names: Some .csv files have incorrect names. For example, in the card_games database, ruling.csv and Set_transactions.csv should be rulings.csv and Set_translations.csv, respectively.

Additionally, it seems that the leaderboard is not updating ever since March. Now that GPT-4 is open, could the leaderboard be updated with GPT-4 standings to avoid us running the scripts separately?

huybery commented 1 year ago

Thank you for your valuable suggestion! We just updated the GPT-4 results in the leaderboard.

accpatrick commented 1 year ago

@Magolor Thanks for your interest in our work. For the first issue, you can simply remove BOM and parse the CSV via latin1. For the second issue, we just provide meaningful descriptions for columns. home_player_<number> is intuitive to understand. For the third issue, thanks for pointing it out! Please download the dev data again, we update descriptions so that the original_column_name can be matched with column names in the databases. Thanks

huybery commented 1 year ago

@Magolor GPT-4 results have been updated in https://bird-bench.github.io/