Open ghaddarAbs opened 2 years ago
Okay, I did a quick inspection and my 5 cent fast solution is to change the pattern in below regex functions in rouge:
https://github.com/google-research/google-research/blob/94ef1c5992057967305cef6cbdd94ab995191279/rouge/tokenize.py#L28 https://github.com/google-research/google-research/blob/94ef1c5992057967305cef6cbdd94ab995191279/rouge/tokenize.py#L32
to [^a-z0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]+
in fact the long term solution is that google add support for other languages in their rouge library and update the code but this is slow.
Now the ROUGE scores on the above examples should be equal to:
gold = "اختر العمر المستقبلي. كن عفويا."
pred = "ابحث عن العمر الذي تريد أن تكونه في المستقبل. تحدث عن نفسك الحالية. فكر في قيمك. فكر في الأشياء التي تجيدها."
gold_2 = gold.replace(". ", ".\n")
pred_2 = pred.replace(". ", ".\n")
score = scorer.score(gold_2, pred_2)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 7.6923076923076925, 'rouge2': 0.0, 'rougeL': 7.6923076923076925}
Hello @ghaddarAbs , first many thanks for your comment, I would like to use ROUGE score in the Arabic text summarization model, but it gives zeros, so can you please explain how to use your method?
Hey @ghaddarAbs, thanks very much for caching that. I spent a lot of time trying to figure out why my rouge scores were almost zero all the time. I used your suggested modification to create a wrapper for it https://github.com/ARBML/rouge_score_ar. By the way, which tables are you referring to by saying Tables 7 and B2? are you referring to arxiv or acl anthology?
@ghaddarAbs and @zaidalyafeai , you can verify with more examples here in this notebook if you wish:
https://colab.research.google.com/drive/1WdDnfNOa9QRevMsvGfPGiYnnnQ7vnycv#scrollTo=W2PTrScMExLp
Hi,
Thanks for sharing the code and models of your great paper.
I think that you miss calculating the rouge scores for Text summarization task in your paper. The bug lies in these lines:
https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L573-L574
Let me explain. First, the rouge_score (which is embedded in HF datasets library) don't work on Arabic text. Here is a simple example were the reference and prediction are the exactly the same:
This happen because the default tokenizer of google rouge wrapper will delete all non alphanumeric characters (see comment 2 for a solution).
However, rouge works well on English:
When in your code you comments these 2 lines because they gives scores around 1%-2% (I will explain why later)
https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L576-L577
and you replace it with these 2 lines:
https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L573-L574
what you actually did is dividing the number of
\\n \\n
span in the reference and prediction. Here is a simple example where the gold reference has 2 sentences and prediction have 4 sentences:As you can see, in the example
<gold_3, pred_3>
the 50 rouge is because you predicted 4 sentences (actually\\n \\n
) while the reference contains 2 only. The 33% is because you have 1 and 3\\n \\n
n-grams in the reference and prediction respectively (1/3).In fact, I re-run Text summarization experiments internally using your model and models mine and found that the results are similarly comparable with your paper when using your method. On the other hand, when adding
\n
correctly the socres are between 1% and 2%. The rouge scores are not zero happen only when the reference and prediction contains same English words, whci happen rarely.In fact, results in tables 7 and B2 are just the count(sent_num_ref) / count(sent_num_pred). Don't understand me wrong, your models are good but just need to be evaluated correctly (see comment 2).
It will be great if you can fix your code and adjust the numbers in table 7 and B2 in your paper.
Thanks