Open Hanzhang-lang opened 3 months ago
Yes, I believe that's correct. The eval code is mostly taken from the official evaluation code of TQA. IIRC, the goal is to convert the comma-separated string into a list and then compare the lists to ensure the order doesn't impact the results.
A question about the evaluation code in the warehouse here. In the TQA dataset, if the gold answers are multi-hop(not one answer), can they be separated by commas to get predictions for multiple-choice answers? Although in DATER, I found that a single answer is used for all evaluation. https://github.com/Leolty/tablellm/blob/aef85050f522900fd70920c2b7427a383e3066ab/utils/eval.py#L234C5-L238C1