Clarification on Metrics in ChatKBQA Results Reproduction

LHRLAB / ChatKBQA

[ACL 2024] Official resources of "ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models".

https://doi.org/10.18653/v1/2024.findings-acl.122

MIT License

272 stars 23 forks source link

Clarification on Metrics in ChatKBQA Results Reproduction #5

Open FUTUREEEEEE opened 12 months ago

FUTUREEEEEE commented 12 months ago

Hello,

I am attempting to reproduce the results of ChatKBQA on the WebQSP dataset, and I have some confusion regarding the metrics used. Specifically, I am trying to determine which of the provided metrics in the repository corresponds to the "F1 Hits@1 Acc 79.8 83.2 73.8" reported in the paper's results.

In the repository, the following metrics are provided for the WebQSP dataset: total:1639, ex_cnt:1026, ex_rate:0.6259914582062233, real_ex_rate:0.6424546023794615, contains_ex_cnt:1227, contains_ex_rate:0.74862721171446 real_contains_ex_rate:0.7683155917345021

I would appreciate it if you could help me understand which of these metrics corresponds to the "F1 Hits@1 Acc" reported in the paper. This clarification will greatly assist me in accurately reproducing the results.

Thank you for your assistance.

Best regards,

LHRLAB commented 12 months ago

This is just an intermediate result; we still need to evaluate the KBQA result with Retrieval.

FUTUREEEEEE commented 11 months ago

Thanks for clearance, after evaluation I got following results, seems match with paper: Number of questions: 1639 Average precision over questions: 0.785 Average recall over questions: 0.814 Average f1 over questions (accuracy): 0.783 F1 of average recall and average precision: 0.799 True accuracy (ratio of questions answered exactly correctly): 0.740 Hits@1 over questions: 0.829

Regards the evaluate process, I found some predicted answers format is unmatched with the label. For example, for question: "WebQTest-362", the predict answer is ['1968-01-01 00:00:00'], while the label are "'m.016j83', 'm.01gqg3', 'm.037pbp', 'm.03kz35', 'm.03r8xj', 'm.03xvj', 'm.03ymyvf', 'm.045...."

This will lead to a decrease of the final accuracy, right?

LHRLAB commented 11 months ago

It should be. The outcome of the query seems to be time-related, but in reality, it's a collection of entities of the label type.

WangYQ999 commented 5 months ago

Hello, I have a question about the difference between "Train LLMs for Logical Form Generation" and "Beam-setting LLMs for Logical Form Generation"when fine-tuning the large model，can I make ft directly in Beam search.

LHRLAB commented 5 months ago

Training LLMs for Logical Form Generation is the SFT phase, which is the training phase. Beam-setting LLMs for Logical Form Generation uses the SFT-trained model to perform beam search generation, which is not a training phase.