How are the metrics: acc, precision, recall of claim level calculated？

LeiyanGithub commented 10 months ago

How are the acc, precision, recall, and other indicators of claim level calculated? Because the claims extracted from the model are definitely different? It is challenging to calculate indicators without manual evaluation. Is the calculation of the indicators here based on the ground truth annotated in the dataset? The default claim is given, and the correctness of each claim is known, corresponding to the label in the dataset. Collecting evidence based on the given claim, and then verify the correctness of the claim and its consistency with the ground truth, in order to conduct factual verification. I don't know if my understanding is correct. Isn't this part of the code missing in the repo？

EthanC111 commented 10 months ago

Hi @LeiyanGithub, thank you for your interest in our paper and for reaching out!

You've understood correctly. We used the dataset that we constructed for our experiments. The claims in this dataset are extracted from ChatGPT (gpt-3.5-turbo). This dataset contains both the claim-level and response-level annotations for each sample. To obtain the score for each metric (accuracy, recall, precision, and F1 Score), we compare the labels predicted by Factool with the annotations of the dataset.

Let me know if you have more questions!

LeiyanGithub commented 10 months ago

Thank you so much for your patience and detailed explanation. I am not sure if the metric evaluation code is present in the repo. Is it possible to publish this part code?

LeiyanGithub commented 10 months ago

Why can't the results reproduced on the kbqa dataset match? After running 233 claims in the dataset, all metrics of both claim level and response level is lower than that presented in the paper

EthanC111 commented 10 months ago

Hi @LeiyanGithub, thanks for asking! I will upload our result and the metric evaluation code in the next few days. However, I want to apologize in advance, as I am currently finalizing another project, so I might not be able to upload it instantly.

In general, one of the main reasons the results might not match is that the APIs are constantly changing (both gpt-3.5-turbo and serper). It's also possible that the result returned by gpt-3.5-turbo is a null value, which could be due to issues like OpenAI's unstable API calls or rate limits. For this part, you might need to run it again to ensure it's not a null value. It can be a bit challenging to determine why the results are different based on a screenshot alone, but thank you for sharing it! If you could provide more details, I could offer more support.

I usually recommend using GPT-4 as the foundational model for KB-QA. As you can see in the paper, due to the limited reasoning capability of GPT-3.5, Factool powered by GPT-3.5 is not the best option. Factool powered by GPT-4 offers significantly better user experiences and is generally considered the default choice for Factool.

Let me know if you have more questions!

GAIR-NLP / factool

How are the metrics: acc, precision, recall of claim level calculated？ #35