Different rouge scores in 1.11.1 and main

yoavkatz commented 3 months ago

Following the Rouge updates , there are changes in rouge scores. In most cases, the diff is less than 1 point, , but it could be 1-2 point in extreme cases. Not sure why the new implementation should cause any diff.

Code:

from unitxt import get_logger
from unitxt.api import evaluate, load_dataset
from unitxt.blocks import TaskCard
from unitxt.collections_operators import Wrap
from unitxt.inference import (
    HFPipelineBasedInferenceEngine,
)
from unitxt.loaders import LoadFromDictionary
from unitxt.text_utils import print_dict

logger = get_logger()

dataset = load_dataset(card="cards.xsum", template_card_index=0,loader_limit=10)
test_dataset = dataset["test"]

# Infer using flan t5 base using HF API
model_name = "google/flan-t5-base"
inference_model = HFPipelineBasedInferenceEngine(
    model_name=model_name, max_new_tokens=32
)

predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)

# Print results
for instance in evaluated_dataset:
    print_dict(
        instance,
        keys_to_print=[
            "source",
            "prediction",
            "processed_prediction",
            "references",
            "score",
        ],
    )

1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154 score (float64): 0.20875488491798488 score_name (str): rougeL instance: rouge1 (float64): 0.19047619047619052 rouge2 (float64): 0.05 rougeL (float64): 0.19047619047619052 rougeLsum (float64): 0.19047619047619052 score (float64): 0.19047619047619052 score_name (str): rougeL

main: score: global: rougeL (float64): 0.20879996542849094 score (float64): 0.20879996542849094 score_name (str): rougeL rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094 rougeL_ci_low (float64): 0.15700219128325088 rougeL_ci_high (float64): 0.2718109259051072 score_ci_low (float64): 0.15700219128325088 score_ci_high (float64): 0.2718109259051072 rouge1_ci_low (float64): 0.23669188547490233 rouge1_ci_high (float64): 0.34410005760392737 rouge2_ci_low (float64): 0.04442823342518798 rouge2_ci_high (float64): 0.13301823219319187 rougeLsum_ci_low (float64): 0.15700219128325088 rougeLsum_ci_high (float64): 0.2718109259051072 instance: rouge1 (float): 0.19047619047619052 rouge2 (float): 0.05 rougeL (float): 0.19047619047619052 rougeLsum (float): 0.19047619047619052 score (float): 0.19047619047619052 score_name (str): rougeL

yoavkatz commented 3 months ago

@dafnapension - If possible, please give this priority, because we want to make a new release this week.

dafnapension commented 3 months ago

Hi @yoavkatz , will gladly do. A difference is expected, since the older version (global HF) with use_aggregator=True, returned a bootstrapped score, whereas now we return a simple average of the instance scores. But I will verify this, and try to understand why the diff is so big.

yoavkatz commented 3 months ago

Hi Dafna. Thanks. Instead of calling the hf inference engine, you can just copy the "target" of instance i to the prediction of instance i+1.

This will simulate a model, and should solve your problem.

yoavkatz commented 3 months ago

See example here

https://github.com/IBM/unitxt/blob/c2fc7ab4caeac1e48d523a34cc34a0cdcc597d16/examples/evaluate_llm_as_judge.py#L43

dafnapension commented 3 months ago

Thanks, @yoavkatz , I did manage to run something, and at least came out with some references and predictions:

predictions = ['Prisoners in Wales are facing a "desperate need" for one-bedroom flats, a charity has said.', 'A man has been charged with armed robbery after a man was arrested in Edinburgh.', 'Four teenagers have been charged with hate crimes after a white man was beaten and beaten in a Chicago court.', 'West Bromwich Albion have appointed former Arsenal goalkeeper Mark Hughes as their new director of football.', 'A fasting diet that mimics famine and famine has been shown to reverse the symptoms of type 1 and type 2 diabetes.', 'The merger between two major European manufacturers of spectacle frames and lenses is a big deal.', 'Wendy Houvenaghel has said she felt "vindicated" by British Cycling\'s failures in the World Class Programme.', 'The success of comedy clubs in the US is largely due to the fact that people are willing to laugh.', 'BT\'s shares were up 3% on Thursday after the company\'s chief executive, a former Ofcom executive, said the company was "not', 'Brendan Rodgers says he is looking forward to his first Old Firm derby with Celtic on Saturday.']

references = [['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'], ['A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.'], ['Four people accused of kidnapping and torturing a mentally disabled man in a "racially motivated" attack streamed on Facebook have been denied bail.'], ['West Brom have appointed Nicky Hammond as technical director, ending his 20-year association with Reading.'], ['The pancreas can be triggered to regenerate itself through a type of fasting diet, say US researchers.'], ['Since their impending merger was announced in January, there has been remarkably little comment about the huge proposed deal to combine Essilor and Luxottica.'], ['A "medal at any cost" approach created a "culture of fear" at British Cycling, says former rider Wendy Houvenaghel.'], ['Have you heard the one about the computer programmer who bought a failing comedy club in Texas and turned it into a million dollar a year business?'], ["The reaction from BT's investors told us much about media regulator Ofcom's ruling on the fate of Openreach, the BT subsidiary that provides much of the UK's broadband infrastructure."], ["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final."]]

I then copied rouge from 1.11.1, called it OldRouge for this examination:

class OldRouge(HuggingfaceMetric):
    hf_metric_name = "rouge"
    main_score = "rougeL"
    scale = 1.0

    prediction_type = "str"
    single_reference_per_prediction = False  # multiple references allowed

    use_aggregator: bool = True
    rouge_types: List[str] = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

    sent_split_newline: bool = True

    _requirements_list: List[str] = ["nltk", "rouge_score"]

    def prepare(self):
        super().prepare()

        self.hf_compute_args.update(
            {"use_aggregator": self.use_aggregator, "rouge_types": self.rouge_types}
        )

        import nltk

        nltk.download("punkt")
        self.sent_tokenize = nltk.sent_tokenize

    def compute(self, references, predictions, task_data: List[Dict]):
        if self.sent_split_newline:
            predictions = [
                "\n".join(self.sent_tokenize(prediction.strip()))
                for prediction in predictions
            ]
            references = [
                ["\n".join(self.sent_tokenize(r.strip())) for r in reference]
                for reference in references
            ]
        return super().compute(references, predictions, task_data)

and then, easily produced both scores:

for metric in [OldRouge(), Rouge()]:
    print(type(metric))
    outputs = apply_metric(metric, predictions, references)
    print_dict(outputs[0]["score"])
    print("\n")

and received:

<class 'unitxt.metrics.OldRouge'>

global:
    rouge1 (float64):
        0.2873483299646091
    rouge2 (float64):
        0.08167020624584168
    rougeL (float64):
        0.20884075796928347
    rougeLsum (float64):
        0.20809322745400183
    score (float64):
        0.20884075796928347
    score_name (str):
        rougeL
    score_ci_low (float64):
        0.1643165941694017
    score_ci_high (float64):
        0.2660200915187788
    rougeL_ci_low (float64):
        0.1643165941694017
    rougeL_ci_high (float64):
        0.2660200915187788
instance:
    rouge1 (float64):
        0.42424242424242425
    rouge2 (float64):
        0.19354838709677422
    rougeL (float64):
        0.30303030303030304
    rougeLsum (float64):
        0.30303030303030304
    score (float64):
        0.30303030303030304
    score_name (str):
        rougeL

<class 'unitxt.metrics.Rouge'>

global:
    rouge1 (float64):
        0.28802664396739114
    rouge2 (float64):
        0.08172129913073843
    rougeLsum (float64):
        0.20879996542849094
    rougeL (float64):
        0.20879996542849094
    score (float64):
        0.20879996542849094
    score_name (str):
        rougeL
    rouge1_ci_low (float64):
        0.23669188547490233
    rouge1_ci_high (float64):
        0.34410005760392737
    rouge2_ci_low (float64):
        0.04442823342518798
    rouge2_ci_high (float64):
        0.13301823219319187
    rougeLsum_ci_low (float64):
        0.15700219128325088
    rougeLsum_ci_high (float64):
        0.2718109259051072
    rougeL_ci_low (float64):
        0.15700219128325088
    rougeL_ci_high (float64):
        0.2718109259051072
    score_ci_low (float64):
        0.15700219128325088
    score_ci_high (float64):
        0.2718109259051072
instance:
    rouge1 (float):
        0.42424242424242425
    rouge2 (float):
        0.19354838709677422
    rougeL (float):
        0.30303030303030304
    rougeLsum (float):
        0.30303030303030304
    score (float):
        0.30303030303030304
    score_name (str):
        rougeL

dafnapension commented 3 months ago

For the current implementation of Rouge - I got identical results as yours. For the HF Global - slightly different, perhaps for a different seed for the randomization in generating their bootstrap. In the current implementation, we receive identical RougeL and RougeLsum. with HF: very close, but not identical. ( for both metrics, sent_split_newline: bool = True). I think this has to do with their bootstrapping, that we avoid.

Put side by side:

and:

The difference do not look different more than we expected. Instance scores are all identical (as expected, just a sanity check), and the global are not too surprising, I think Do you see something exceptional?

yoavkatz commented 3 months ago

Hhi Dafna. Can you look at all the instance scores and not only the first ? Perhaps there is one instance with a big difference that affects the whole average. As I mentioned, in most runs the diffs is small, but even in the example above Rouge1 is 0.7 point diff.

1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154

main: rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094

dafnapension commented 3 months ago

Hi @yoavkatz , of course! Here is the small script I used (over the given predictions and references above):

outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)

print("\tCurrent Rouge\tHF Rouge")
for i, (current, old) in enumerate (zip(outputs_rouge, outputs_old_rouge)):
    print(f"instance {i}")
    for score_name in ["rouge1", "rouge2", "rougeL", "rougeLsum", "score", "score_name"]:
        cu_score = current["score"]["instance"][score_name]
        ol_score = old["score"]["instance"][score_name]
        print(f"{score_name};{cu_score};{ol_score}")

and got this excel. Seems that all instances have identical scores:

dafnapension commented 3 months ago

The difference you are looking at (which you bolded in https://github.com/IBM/unitxt/issues/1078#issuecomment-2257871344) is not 0.7, it is 0.0007.

yoavkatz commented 3 months ago

The difference you are looking at (which you bolded in #1078 (comment)) is not 0.7, it is 0.0007.

You are right (I meant 0.07 points and which is 0.0007) in absolute numbers..

The fact that all the instance scores are the same, but the aggregtaion is different is something to consider.

The reason is that in the OldRough

Instance results - each instance was passed to the HF metric on its own
Global results - were calculated by passing all the predictions and references to the HF metric

In the new code:

Instance results - each instance is calculated on it's own
Global result is the average of instance results.

So it seems there is some difference in the global result between the two approaches.

yoavkatz commented 3 months ago

This is the rouge code:

https://huggingface.co/spaces/evaluate-metric/rouge/blob/e2671c0764b07f287918af2338dfbd162c14cd07/rouge.py#L121

dafnapension commented 3 months ago

Hi @yoavkatz , yes, in our implementation: we average the instance scores to get the global result. HF, when use_aggregator=True, simply bootstrap the instance scores (resample from the list of instance scores many times, average each resample to get the global score of that resample, and to us -- they return the median of these [resampled] global scores). So, I think we can anticipate some difference.

I will make another experiment: I will use the OldRouge, but with use_aggregator=False, and will average the returned list (of instance scores) by myself. Coming up.

yoavkatz commented 3 months ago

Now I understand. Thank you. I also talked with Elron. Since people are used to the HF Rouge score, we need to be comparable with it. One way to do it is to actually run the same code which uses their bootstrapping and not ours. This would require changing the Rouge code back to a GlobalMetric.

Another option is use our bootstrapping, but allow overwritting the score with with median of the bootstrapping like they do (if a flag is set, not by default).

dafnapension commented 3 months ago

Thanks, @yoavkatz , just to complete the "proof": indeed, averaging the (list) returned from HF, when use_aggregator=False, yields same results as we get with our implementation:

| Current Rouge | HF Rouge -- | -- | -- | | rouge1 | 0.288026644 | 0.288026644 rougeLsum | 0.208799965 | 0.208799965 rougeL | 0.208799965 | 0.208799965 score | 0.208799965 | 0.208799965 score_name | rougeL | rougeL rouge2 | 0.081721299 | 0.081721299 rouge1_ci_low | 0.236691885 | not_computed rouge1_ci_high | 0.344100058 | not_computed rougeLsum_ci_low | 0.157002191 | not_computed rougeLsum_ci_high | 0.271810926 | not_computed rougeL_ci_low | 0.157002191 | not_computed rougeL_ci_high | 0.271810926 | not_computed score_ci_low | 0.157002191 | not_computed score_ci_high | 0.271810926 | not_computed rouge2_ci_low | 0.044428233 | not_computed rouge2_ci_high | 0.133018232 | not_computed

(generated by this piece of code, which uses the OldRouge, now with use_aggregator=False, and n_resamples=0 (we can not ci with vectors)):

outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)

print("*** global score of old_rouge:*****")
old_global = outputs_old_rouge[0]["score"]["global"]
print_dict(old_global)

print("*** averaging the list of scores using np.nanmean()*** ")
for score_name in old_global:
    if score_name == "score_name":
        continue
    old_global[score_name] = np.nanmean(old_global[score_name])
print_dict(old_global)

print("*** comparing averaged_old_global   against current_global*****")
current_global = outputs_rouge[0]["score"]["global"]
for score_name in current_global:
    cu_score = current_global[score_name]
    if score_name in old_global:
        ol_score = old_global[score_name]
    else:
        ol_score = 'not_computed'
    print(f"{score_name};{cu_score};{ol_score}")

yoavkatz commented 3 months ago

The second option may e simpler:

Just change the code here:

What do you think?

dafnapension commented 3 months ago

hi @yoavkatz , I think we can also offer all the options we have, explain pros and cons, and let the user choose whatever they want.

yoavkatz commented 3 months ago

We want to make it simple - and backward compatible. Later we can change. So we suggest

1 ) have a flag in metric override_score_with_ci_mid which will now only be set to true in Rouge. 2) Change the above code to

         result[f"{full_score_name}_ci_low"] = ci.low
        result[f"{full_score_name}_ci_high"] = ci.high
        if (self.override_score_with_ci_mid):
             result[full_score_name] = ci.mid

        if score_name == self.main_score:
            result["score_ci_low"] = ci.low
            result["score_ci_high"] = ci.high
             if (self.override_score_with_ci_mid):
                 result["score"] = ci.mid

3 Set n_resamples to be 1000 in rouge.

dafnapension commented 3 months ago

Coming up. Set n_resamples to be 1000, to be as HF use?

yoavkatz commented 3 months ago

Yes. That's the default there.

dafnapension commented 3 months ago

Hi @yoav, I am pushing a PR for you to see. I am running the inference that I used along the day to compare. Still not same scores. Looking into.

dafnapension commented 1 month ago

Adressed in PR # 1084, and concluded in a decision to maintain the existing implementation: https://github.com/IBM/unitxt/pull/1084#issuecomment-2267440729

IBM / unitxt

Different rouge scores in 1.11.1 and main #1078