Closed yoavkatz closed 1 month ago
@dafnapension - If possible, please give this priority, because we want to make a new release this week.
Hi @yoavkatz , will gladly do. A difference is expected, since the older version (global HF) with use_aggregator=True, returned a bootstrapped score, whereas now we return a simple average of the instance scores. But I will verify this, and try to understand why the diff is so big.
Hi Dafna. Thanks. Instead of calling the hf inference engine, you can just copy the "target" of instance i to the prediction of instance i+1.
This will simulate a model, and should solve your problem.
Thanks, @yoavkatz , I did manage to run something, and at least came out with some references and predictions:
predictions = ['Prisoners in Wales are facing a "desperate need" for one-bedroom flats, a charity has said.', 'A man has been charged with armed robbery after a man was arrested in Edinburgh.', 'Four teenagers have been charged with hate crimes after a white man was beaten and beaten in a Chicago court.', 'West Bromwich Albion have appointed former Arsenal goalkeeper Mark Hughes as their new director of football.', 'A fasting diet that mimics famine and famine has been shown to reverse the symptoms of type 1 and type 2 diabetes.', 'The merger between two major European manufacturers of spectacle frames and lenses is a big deal.', 'Wendy Houvenaghel has said she felt "vindicated" by British Cycling\'s failures in the World Class Programme.', 'The success of comedy clubs in the US is largely due to the fact that people are willing to laugh.', 'BT\'s shares were up 3% on Thursday after the company\'s chief executive, a former Ofcom executive, said the company was "not', 'Brendan Rodgers says he is looking forward to his first Old Firm derby with Celtic on Saturday.']
references = [['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'], ['A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.'], ['Four people accused of kidnapping and torturing a mentally disabled man in a "racially motivated" attack streamed on Facebook have been denied bail.'], ['West Brom have appointed Nicky Hammond as technical director, ending his 20-year association with Reading.'], ['The pancreas can be triggered to regenerate itself through a type of fasting diet, say US researchers.'], ['Since their impending merger was announced in January, there has been remarkably little comment about the huge proposed deal to combine Essilor and Luxottica.'], ['A "medal at any cost" approach created a "culture of fear" at British Cycling, says former rider Wendy Houvenaghel.'], ['Have you heard the one about the computer programmer who bought a failing comedy club in Texas and turned it into a million dollar a year business?'], ["The reaction from BT's investors told us much about media regulator Ofcom's ruling on the fate of Openreach, the BT subsidiary that provides much of the UK's broadband infrastructure."], ["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final."]]
I then copied rouge from 1.11.1, called it OldRouge for this examination:
class OldRouge(HuggingfaceMetric):
hf_metric_name = "rouge"
main_score = "rougeL"
scale = 1.0
prediction_type = "str"
single_reference_per_prediction = False # multiple references allowed
use_aggregator: bool = True
rouge_types: List[str] = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
sent_split_newline: bool = True
_requirements_list: List[str] = ["nltk", "rouge_score"]
def prepare(self):
super().prepare()
self.hf_compute_args.update(
{"use_aggregator": self.use_aggregator, "rouge_types": self.rouge_types}
)
import nltk
nltk.download("punkt")
self.sent_tokenize = nltk.sent_tokenize
def compute(self, references, predictions, task_data: List[Dict]):
if self.sent_split_newline:
predictions = [
"\n".join(self.sent_tokenize(prediction.strip()))
for prediction in predictions
]
references = [
["\n".join(self.sent_tokenize(r.strip())) for r in reference]
for reference in references
]
return super().compute(references, predictions, task_data)
and then, easily produced both scores:
for metric in [OldRouge(), Rouge()]:
print(type(metric))
outputs = apply_metric(metric, predictions, references)
print_dict(outputs[0]["score"])
print("\n")
and received:
<class 'unitxt.metrics.OldRouge'>
global:
rouge1 (float64):
0.2873483299646091
rouge2 (float64):
0.08167020624584168
rougeL (float64):
0.20884075796928347
rougeLsum (float64):
0.20809322745400183
score (float64):
0.20884075796928347
score_name (str):
rougeL
score_ci_low (float64):
0.1643165941694017
score_ci_high (float64):
0.2660200915187788
rougeL_ci_low (float64):
0.1643165941694017
rougeL_ci_high (float64):
0.2660200915187788
instance:
rouge1 (float64):
0.42424242424242425
rouge2 (float64):
0.19354838709677422
rougeL (float64):
0.30303030303030304
rougeLsum (float64):
0.30303030303030304
score (float64):
0.30303030303030304
score_name (str):
rougeL
<class 'unitxt.metrics.Rouge'>
global:
rouge1 (float64):
0.28802664396739114
rouge2 (float64):
0.08172129913073843
rougeLsum (float64):
0.20879996542849094
rougeL (float64):
0.20879996542849094
score (float64):
0.20879996542849094
score_name (str):
rougeL
rouge1_ci_low (float64):
0.23669188547490233
rouge1_ci_high (float64):
0.34410005760392737
rouge2_ci_low (float64):
0.04442823342518798
rouge2_ci_high (float64):
0.13301823219319187
rougeLsum_ci_low (float64):
0.15700219128325088
rougeLsum_ci_high (float64):
0.2718109259051072
rougeL_ci_low (float64):
0.15700219128325088
rougeL_ci_high (float64):
0.2718109259051072
score_ci_low (float64):
0.15700219128325088
score_ci_high (float64):
0.2718109259051072
instance:
rouge1 (float):
0.42424242424242425
rouge2 (float):
0.19354838709677422
rougeL (float):
0.30303030303030304
rougeLsum (float):
0.30303030303030304
score (float):
0.30303030303030304
score_name (str):
rougeL
For the current implementation of Rouge - I got identical results as yours. For the HF Global - slightly different, perhaps for a different seed for the randomization in generating their bootstrap. In the current implementation, we receive identical RougeL
and RougeLsum
. with HF: very close, but not identical. ( for both metrics, sent_split_newline: bool = True
). I think this has to do with their bootstrapping, that we avoid.
Put side by side:
and:
The difference do not look different more than we expected. Instance scores are all identical (as expected, just a sanity check), and the global are not too surprising, I think Do you see something exceptional?
Hhi Dafna. Can you look at all the instance scores and not only the first ? Perhaps there is one instance with a big difference that affects the whole average. As I mentioned, in most runs the diffs is small, but even in the example above Rouge1 is 0.7 point diff.
1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154
main: rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094
Hi @yoavkatz , of course! Here is the small script I used (over the given predictions and references above):
outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)
print("\tCurrent Rouge\tHF Rouge")
for i, (current, old) in enumerate (zip(outputs_rouge, outputs_old_rouge)):
print(f"instance {i}")
for score_name in ["rouge1", "rouge2", "rougeL", "rougeLsum", "score", "score_name"]:
cu_score = current["score"]["instance"][score_name]
ol_score = old["score"]["instance"][score_name]
print(f"{score_name};{cu_score};{ol_score}")
and got this excel. Seems that all instances have identical scores:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| Current Rouge | HF Rouge -- | -- | -- instance 0 | | rouge1 | 0.424242424 | 0.424242424 rouge2 | 0.193548387 | 0.193548387 rougeL | 0.303030303 | 0.303030303 rougeLsum | 0.303030303 | 0.303030303 score | 0.303030303 | 0.303030303 score_name | rougeL | rougeL instance 1 | | rouge1 | 0.375 | 0.375 rouge2 | 0.2 | 0.2 rougeL | 0.375 | 0.375 rougeLsum | 0.375 | 0.375 score | 0.375 | 0.375 score_name | rougeL | rougeL instance 2 | | rouge1 | 0.372093023 | 0.372093023 rouge2 | 0.097560976 | 0.097560976 rougeL | 0.23255814 | 0.23255814 rougeLsum | 0.23255814 | 0.23255814 score | 0.23255814 | 0.23255814 score_name | rougeL | rougeL instance 3 | | rouge1 | 0.3125 | 0.3125 rouge2 | 0.066666667 | 0.066666667 rougeL | 0.3125 | 0.3125 rougeLsum | 0.3125 | 0.3125 score | 0.3125 | 0.3125 score_name | rougeL | rougeL instance 4 | | rouge1 | 0.358974359 | 0.358974359 rouge2 | 0.054054054 | 0.054054054 rougeL | 0.153846154 | 0.153846154 rougeLsum | 0.153846154 | 0.153846154 score | 0.153846154 | 0.153846154 score_name | rougeL | rougeL instance 5 | | rouge1 | 0.2 | 0.2 rouge2 | 0 | 0 rougeL | 0.1 | 0.1 rougeLsum | 0.1 | 0.1 score | 0.1 | 0.1 score_name | rougeL | rougeL instance 6 | | rouge1 | 0.222222222 | 0.222222222 rouge2 | 0.117647059 | 0.117647059 rougeL | 0.111111111 | 0.111111111 rougeLsum | 0.111111111 | 0.111111111 score | 0.111111111 | 0.111111111 score_name | rougeL | rougeL instance 7 | | rouge1 | 0.170212766 | 0.170212766 rouge2 | 0 | 0 rougeL | 0.127659574 | 0.127659574 rougeLsum | 0.127659574 | 0.127659574 score | 0.127659574 | 0.127659574 score_name | rougeL | rougeL instance 8 | | rouge1 | 0.254545455 | 0.254545455 rouge2 | 0.037735849 | 0.037735849 rougeL | 0.181818182 | 0.181818182 rougeLsum | 0.181818182 | 0.181818182 score | 0.181818182 | 0.181818182 score_name | rougeL | rougeL instance 9 | | rouge1 | 0.19047619 | 0.19047619 rouge2 | 0.05 | 0.05 rougeL | 0.19047619 | 0.19047619 rougeLsum | 0.19047619 | 0.19047619 score | 0.19047619 | 0.19047619 score_name | rougeL | rougeL
Following the Rouge updates , there are changes in rouge scores. In most cases, the diff is less than 1 point, , but it could be 1-2 point in extreme cases. Not sure why the new implementation should cause any diff.
Code:
1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154 score (float64): 0.20875488491798488 score_name (str): rougeL instance: rouge1 (float64): 0.19047619047619052 rouge2 (float64): 0.05 rougeL (float64): 0.19047619047619052 rougeLsum (float64): 0.19047619047619052 score (float64): 0.19047619047619052 score_name (str): rougeL
main: score: global: rougeL (float64): 0.20879996542849094 score (float64): 0.20879996542849094 score_name (str): rougeL rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094 rougeL_ci_low (float64): 0.15700219128325088 rougeL_ci_high (float64): 0.2718109259051072 score_ci_low (float64): 0.15700219128325088 score_ci_high (float64): 0.2718109259051072 rouge1_ci_low (float64): 0.23669188547490233 rouge1_ci_high (float64): 0.34410005760392737 rouge2_ci_low (float64): 0.04442823342518798 rouge2_ci_high (float64): 0.13301823219319187 rougeLsum_ci_low (float64): 0.15700219128325088 rougeLsum_ci_high (float64): 0.2718109259051072 instance: rouge1 (float): 0.19047619047619052 rouge2 (float): 0.05 rougeL (float): 0.19047619047619052 rougeLsum (float): 0.19047619047619052 score (float): 0.19047619047619052 score_name (str): rougeL