Issues about evaluate WikiTQ

Leolty / tablellm

Apache License 2.0

14 stars 0 forks source link

Issues about evaluate WikiTQ #1

Open Hanzhang-lang opened 3 months ago

Hanzhang-lang commented 3 months ago

A question about the evaluation code in the warehouse here. In the TQA dataset, if the gold answers are multi-hop(not one answer), can they be separated by commas to get predictions for multiple-choice answers? Although in DATER, I found that a single answer is used for all evaluation. https://github.com/Leolty/tablellm/blob/aef85050f522900fd70920c2b7427a383e3066ab/utils/eval.py#L234C5-L238C1

Leolty commented 3 months ago

Yes, I believe that's correct. The eval code is mostly taken from the official evaluation code of TQA. IIRC, the goal is to convert the comma-separated string into a list and then compare the lists to ensure the order doesn't impact the results.