PrimerAI / blanc

Human-free quality estimation of document summaries
MIT License
94 stars 11 forks source link

Help understanding Shannon().go(...) #46

Open UntotaufUrlaub opened 1 year ago

UntotaufUrlaub commented 1 year ago

Hi, I just read the paper "Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation". Thanks for making the metric accessible!

Could you please help me understand the Shannon().go(...) function. What are measure_t, measure_summ used for? What are the components of the return? Do they include ShannonScore, InformationDifference and BlancShannon?

I tried to inspect the code, but i didn't fully get it yet. My best guess atm is that the return consists of the components used to calculate ShannonScore and InformationDifference. If that's correct: How do I use them to get the final scores?

kind regards.

OlegVasilyev4096 commented 1 year ago

Thanks for your interest in these measures. The default output of Shannon().go(...) gives all that is necessary, by the first 4 elements of return: ll_base, ll_help, ll_full, S In terms of the paper I(D) is given by -ll_base, I(D|S) is given by -ll_help, and I(D|D) is given by -ll_full. According to this, the Shannon Score can be obtained as (ll_help - ll_base) / (ll_full - ll_base). And the Information difference is (ll_help - ll_base). And the BLANC-Shannon score is defined as (S[0][1] - S[1][0]) / (S[0][0] + S[0][1] + S[1][0] + S[1][1]). In the paper this corresponds to the equations in Section 2.2 (and Blanc-Shannon is similar to the original BLANC, as explained in Section 5.3). The options measure_t and measure_summ were in there only for auxiliary explorations.

UntotaufUrlaub commented 1 year ago

Thanks for helping!

I am trying to follow this procedure, but my return values look suspicious. Could you please confirm my setup and result? Code:

estimator = Shannon()
ll_base, ll_help, ll_full, S, _, _ = estimator.go(
    'Jack drove his minivan to the bazaar to purchase milk and honey for his large family',
    'Jack bought milk and honey from the bazaar')

print(f"ll_base: {ll_base}, ll_help: {ll_help}, ll_full: {ll_full}, S: {S}")
print({
    "ShannonScore": (ll_help - ll_base) / (ll_full - ll_base),
    "InfoDiff": ll_help - ll_base,
    "BlancShannon": (S[0][1] - S[1][0]) / (S[0][0] + S[0][1] + S[1][0] + S[1][1]),
})

Result:

ll_base: -204.98789831962313, ll_help: -204.98783949688476, ll_full: -204.98785211787543, S: [[18, 0], [0, 0]]
{'ShannonScore': 1.2731712825592862, 'InfoDiff': 5.8822738367325655e-05, 'BlancShannon': 0.0}

I did not expect ll_base, ll_help, and ll_full to be nearly identical. Shouldn't there be more difference? Also BlancShannon was 0 for all inputs I tried (about 20)

OlegVasilyev4096 commented 1 year ago

Hi, thanks for following up with this. BTW I noticed one inconvenience in the current Shannon, it always requires cuda, I will change that today/tomorrow. But regardless of that, I copied exactly your code snippet above, and got the following:

ll_base: -73.96030467789531, ll_help: -52.87288321110423, ll_full: -11.923419410366943, S: [[8, 4], [1, 5]] {'ShannonScore': 0.3399174761249459, 'InfoDiff': 21.087421466791078, 'BlancShannon': 0.16666666666666666}

Could you just in case specify the system you have run your code on? It is very puzzling.

UntotaufUrlaub commented 1 year ago

Thanks! At the moment I guess the issue is caused by the used python version. Running on 3.11 or 3.9 gives me the above output, using 3.6 I get the results that you reported. Could you check this?

OlegVasilyev4096 commented 1 year ago

Thanks! Indeed, it gets these two different results in different environments, for GPT2 (default model). Seems like in your result S: [[18, 0], [0, 0]] all successes become no-success - altogether 18; in normal result it is a mix of success and not success: S: [[8, 4], [1, 5]]. 8+4+1+5=18 Like if it cannot predict tokens correctly. I am checking this. (Probably the problem is with gpt2, because for gpt1 and for xlm I get reasonable results coinciding in different environments.)

OlegVasilyev4096 commented 1 year ago

The difference was due to the function prepare_inputs_for_generation() of gpt2 which had been different in different versions. Now a 'good' copy is simply in the code, separated from gpt2. Updated the package. I get the same results now everywhere. See if this works for you. Thanks again for catching this.

UntotaufUrlaub commented 1 year ago

Hi, I am trying to verify that I get the same results everywhere. I just figured out in the case of python 3.6 I had blanc v0.2.7 installed, as newer versions rely on a scipy version that is not available for python 3.6, as far as I see. Is this the same for you? If not please share how to install the newer version within python 3.6.

I used a script to evaluate one example (example extracted from aggrefact). Code:

from blanc import Shannon

import nltk
nltk.download('punkt')

estimator = Shannon()

def shannon(summ, doc):
    ll_base, ll_help, ll_full, S, _, _ = estimator.go(doc, summ)
    return str({
        "ShannonScore": (ll_help - ll_base) / (ll_full - ll_base),
        "InfoDiff": ll_help - ll_base,
        "BlancShannon": (S[0][1] - S[1][0]) / (S[0][0] + S[0][1] + S[1][0] + S[1][1]),
        "ll_base": ll_base,
        "ll_help": ll_help,
        "ll_full": ll_full,
        "S": S,
    })

summ_example = """elena curtin of portland was seven-months pregnant when she was charged with second-degree assault after she struck her boyfriend 's ex in the head and arm with a crowbar in november 2014 . 
    she was set to go to trial in multnomah county circuit court this week , but prosecutors dropped the charge on monday because curtin was ` completely justified in her outrage ` 
    curtin 's defense attorney said the ex-girlfriend was jealous and bitter that the boyfriend was in a relationship with his client ."""

doc_example = """an oregon woman who came home and beat her boyfriend 's former girlfriend with a crowbar after finding her getting high on heroin in her bathroom will no longer face prosecution .
    elena curtin of portland was seven-months pregnant when she was charged with second-degree assault after she struck her boyfriend 's ex in the head and arm with a crowbar in november 2014 .
    she was set to go to trial in multnomah county circuit court this week , but prosecutors dropped the charge on monday because curtin , 23 , was ` completely justified in her outrage ' .
    elena curtin of portland , oregon , was seven-months pregnant when she was charged with second-degree assault after she struck her boyfriend 's ex in the head and arm with a crowbar in november 2014
    curtin gave birth in january .
    when curtin came home , she found her boyfriend 's ex-girlfriend shooting heroin while sitting on her toilet , oregonian/oregonlive reported .
    when she asked her to leave and the woman refused , curtin beat her with a crowbar .
    oregon law allows for use of physical force against an intruder who wo n't leave a resident 's home .
    curtin 's defense attorney , casey kovacic , said the ex-girlfriend was ` jealous and bitter ' that the boyfriend was in a relationship with his client .
    the boyfriend , who is the father of curtin 's child , struggled with heroin in the past .
    the ex and the boyfriend had gotten high at the apartment before and they have a five-year-old child .
    they have since reconciled and curtin is now living with her parents .
    the boyfriend has not been a part of his new child 's life .
    kovacic wrote in an email : ` in the two years leading up to this incident , [ the ex ] made it her personal mission to make elena 's life miserable .
    ` she routinely harassed and threatened to hurt elena , stole from her , and cruelly plotted to drag [ the boyfriend ] back into addiction .
    ` that 's one silver lining - she 's [ curtin ] been able to examine how bad ( her relationship ) was and move on with her life . '
    if she had gone to trial and been convicted , curtin would have faced received a mandatory prison sentence of almost six years .
    curtin was charged after coming home and finding her boyfriend 's ex-girlfriend shooting heroin in her bathroom"""

print(shannon(summ_example, doc_example))

I ran docker container run --rm --gpus all -v .../test_shannon.py:/test_shannon.py python:3.6 bash -c "pip install scipy blanc; python test_shannon.py" which installed blanc 0.2.7 and got the following result: {'ShannonScore': 0.4550786707000613, 'InfoDiff': 927.4927780728983, 'BlancShannon': 0.2693798449612403, 'll_base': -2300.84119757175, 'll_help': -1373.3484194988516, 'll_full': -262.74792318007593, 'S': [[249, 148], [9, 110]]}

and ran docker container run --rm --gpus all -v .../test_shannon.py:/test_shannon.py python:3.11 bash -c "pip install scipy blanc; python test_shannon.py" which installed blanc 0.3.1 and got the following result: {'ShannonScore': 0.46075807542587466, 'InfoDiff': 941.4941014983935, 'BlancShannon': 0.27734375, 'll_base': -2322.0597345244287, 'll_help': -1380.5656330260351, 'll_full': -278.7008620224572, 'S': [[249, 150], [8, 105]]}

Both results seem to be in a reasonable range but I am surprised that they are not the same. Is this to be expected because of changes from version 0.2.7 to 0.3.0?

Atm I have no idea how to compare v0.3.0 and v0.2.7 directly, as I dont know how to run v0.3.0 without the bug. If you have any idea, let me know and I try to confirm your result.