Open FUTUREEEEEE opened 12 months ago
This is just an intermediate result; we still need to evaluate the KBQA result with Retrieval.
Thanks for clearance, after evaluation I got following results, seems match with paper:
Number of questions: 1639 Average precision over questions: 0.785 Average recall over questions: 0.814 Average f1 over questions (accuracy): 0.783 F1 of average recall and average precision: 0.799 True accuracy (ratio of questions answered exactly correctly): 0.740 Hits@1 over questions: 0.829
Regards the evaluate process, I found some predicted answers format is unmatched with the label. For example, for question: "WebQTest-362", the predict answer is ['1968-01-01 00:00:00'], while the label are "'m.016j83', 'm.01gqg3', 'm.037pbp', 'm.03kz35', 'm.03r8xj', 'm.03xvj', 'm.03ymyvf', 'm.045...."
This will lead to a decrease of the final accuracy, right?
It should be. The outcome of the query seems to be time-related, but in reality, it's a collection of entities of the label type.
Hello, I have a question about the difference between "Train LLMs for Logical Form Generation" and "Beam-setting LLMs for Logical Form Generation"when fine-tuning the large model,can I make ft directly in Beam search.
Training LLMs for Logical Form Generation is the SFT phase, which is the training phase. Beam-setting LLMs for Logical Form Generation uses the SFT-trained model to perform beam search generation, which is not a training phase.
Hello,
I am attempting to reproduce the results of ChatKBQA on the WebQSP dataset, and I have some confusion regarding the metrics used. Specifically, I am trying to determine which of the provided metrics in the repository corresponds to the "F1 Hits@1 Acc 79.8 83.2 73.8" reported in the paper's results.
In the repository, the following metrics are provided for the WebQSP dataset:
total:1639, ex_cnt:1026, ex_rate:0.6259914582062233, real_ex_rate:0.6424546023794615, contains_ex_cnt:1227, contains_ex_rate:0.74862721171446 real_contains_ex_rate:0.7683155917345021
I would appreciate it if you could help me understand which of these metrics corresponds to the "F1 Hits@1 Acc" reported in the paper. This clarification will greatly assist me in accurately reproducing the results.
Thank you for your assistance.
Best regards,