Low output score - reasoning output written to final output

I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.

Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.

Please let us know your thoughts. Any help would be appreciated. Thank you

IDEA-FinAI / ToG

Low output score - reasoning output written to final output #29