We use F1 score as the primary evaluation metric in our study, calculated by comparing
the model’s predicted answers to the gold standard answers. In addition to F1 score, we also use
the Exact Match metric. However, unlike previous studies that measure Exact Match based on the
logical form, we assess it based on the exact match between the predicted and gold answer sets.
Lastly, we also evaluate the Executability of the action sequences generated by the model. If the
model’s action sequence produces any set of answers when executed, it scores 1.0 for Executability.
If it fails to produce an answer, it scores 0.
We use F1 score as the primary evaluation metric in our study, calculated by comparing the model’s predicted answers to the gold standard answers. In addition to F1 score, we also use the Exact Match metric. However, unlike previous studies that measure Exact Match based on the logical form, we assess it based on the exact match between the predicted and gold answer sets. Lastly, we also evaluate the Executability of the action sequences generated by the model. If the model’s action sequence produces any set of answers when executed, it scores 1.0 for Executability. If it fails to produce an answer, it scores 0.