Difficulty Reproducing Results from QuestEval

Hello, I'm trying to reproduce the results of the QuestEval paper but I'm only getting scores of 30 for consistency and 20 for Fluency. Could you please offer some instructions about the setup?

The settings I'm using are described below:

Dataset: the SummEval dataset which contains 1600 human-annotated examples. As each example contains three human annotators, I take the average of the three expert scores to calculate correlation.
Correlation Function: Pearson correlation.
Consistency and Weighter: I'm employing none of these settings, just set both "do_consistency" and "do_weighter" to false.
Final scoring function: while the original code used score = np.average([hyp_score, compared_score]), I've changed it to 2 * hyp_score * compared_score / (hyp_score + compared_score) as described in the paper.

Thank you!

ThomasScialom / QuestEval

Difficulty Reproducing Results from QuestEval #11