PrimerAI / blanc

Human-free quality estimation of document summaries
MIT License
94 stars 11 forks source link

Reproducibility of Shannon score paper results #48

Closed felipemaiapolo closed 1 year ago

felipemaiapolo commented 1 year ago

Dear authors,

Thank you very much for making your implementation available. I am trying to reproduce the results for Shannon score and Information Difference score using the SummEval benchmark. However, the results I am getting are really poor compared to the paper's (all correlations < .3). I am running the 1.6k source texts and their summaries through the class below.

Am I doing anything wrong? Thank you!

from blanc.blanc.shannon import Shannon

class ShannonScorer:
    def __init__(self):
        pass

    def score(self, srcs, hyps, verbose=False):

        assert len(srcs)==len(hyps)

        scorer = Shannon()

        def score_aux(doc, summ):
            ll_base, ll_help, ll_full, _, _, _ = scorer.go(doc, summ)
            shannon = (ll_help - ll_base) / (ll_full - ll_base)
            infodif = ll_help - ll_base
            return [shannon, infodif]

        scores = []
        for i in tqdm(range(len(hyps)), disable=not verbose):
            scores.append(score_aux(srcs[i], hyps[i]))

        return scores
felipemaiapolo commented 1 year ago

I realized the paper calculates the system-level correlations, and I was calculating other things.