Open farinamhz opened 6 months ago
Hi @farinamhz, I've calculated the statistics for twitter dataset and google translate's review files, which I have uploaded to the OneDrive paths under LADy0.2.0.1 > statistics.
Previous conversation:
[Thursday 9:16 PM] Christine Wong Hi Farinam Hemmati Zadeh, I unfortunately cannot make it to Friday's progress meeting this week, but I've added my progress to the issues pages and have made a PR so it can be reviewed. In addition to the questions there, I've also noticed that the exact match metrics calculated using LADy0.2.0.0's semeval datasets are different from the semeval+ statistics. Would this be something to be concerned about?
[Friday 11:59 AM] Farinam Hemmati Zadeh Hey Christine, no worries! Thank you very much for the update and your work! You mean the newly translated reviews are different from the previously translated ones, right? If so, it is right as the translated reviews for LADy0.2.0.1 are from a new translator (googletranslate).
[Friday 12:42 PM] Farinam Hemmati Zadeh Christine Wong I just realized that you said LADy0.2.0.0. LADy0.2.0.0 should be the same as before as only twitter has been added to this version. However, I wanted you to calculate the metrics for LADy0.2.0.1 which is for googletranslate results. I think there was a confusion between these two. Which one have you calculated now? [Friday 12:44 PM] Farinam Hemmati Zadeh In fact, previous results should not have any significant difference in compare with LADy 0.2.0.0. Did they have? Christine [Friday 9:06 PM] Christine Wong Hi Farinam Hemmati Zadeh, I've calculated the twitter metrics based on the LADy0.2.0.0 which was put into the readme.md. I've also calculated all the datasets (semeval + twitter) for the googletranslate results in LADy0.2.0.1 which are not in readme.md but uploaded to OneDrive.
[Friday 9:16 PM] Christine Wong The results aren't too different (around 0.01 difference compare from LADy0.2.0.0 to the data in the readme.md), but I was wondering if it was alright to "mix" the results together, since it seems the twitter metrics may look different if it was produced at the same time as the metrics from the readme.
[Friday 9:19 PM] Christine Wong I've also attached a run with the semeval15/16's result for better comparison: it seems like the newer version have higher em metric, which might be a good thing since it might suggest similar sentence structure, etc.?
Hey @Lillliant, Let's continue here. I appreciate the updates you've provided. Everything is going well, but I'm facing some confusion regarding the calculation of BLEU and ROUGE scores. It appears that when dealing with longer tweets with diverse contexts, they do not yield accurate exact match results in comparison with semeval datasets. Perhaps we should consider exploring alternative metrics like BLEU and ROUGE in this context. So I want to make sure what are the inputs of BLEU and ROUGE.
Hi @farinamhz, sure! I've attached my run of the twitter (LADy0.2.0.0) dataset here:
For bleu metrics: | dataset | pes_Arab_bleu | zho_Hans_bleu | deu_Latn_bleu | arb_Arab_bleu | fra_Latn_bleu | spa_Latn_bleu |
---|---|---|---|---|---|---|---|
0.2110 | 0.1892 | 0.4025 | 0.33383 | 0.3891 | 0.4439 | ||
semeval-2016-restaurant | 0.3746 | 0.3065 | 0.5435 | 0.4465 | 0.5314 | 0.5864 | |
semeval-2015-restaurant | 0.3787 | 0.3080 | 0.5514 | 0.4523 | 0.5318 | 0.5895 |
For rouge metrics: | dataset | pes_Arab_rouge_f | zho_Hans_rouge_f | deu_Latn_rouge_f | arb_Arab_rouge_f | fra_Latn_rouge_f | spa_Latn_rouge_f |
---|---|---|---|---|---|---|---|
0.1889 | 0.1677 | 0.3307 | 0.2589 | 0.3117 | 0.3596 | ||
semeval-2016-restaurant | 0.2802 | 0.2224 | 0.4258 | 0.3360 | 0.4089 | 0.4628 | |
semeval-2015-restaurant | 0.2783 | 0.2224 | 0.4332 | 0.3387 | 0.4076 | 0.4661 |
Here, the bleu metrics we had was obtained by computing the average of the bleu score means for each sentence calculated using weight=[(1.0,), (0.5, 0.5), (0.3333, 0.3333, 0.3333)].
The rouge metrics are obtained by computing the average of the F1-score means from rouge-1 to rouge-5 for each sentence.
Hi @Lillliant,
As you move forward with updating the
README
, you will need to calculate the metrics that we previously had for translation between English and other languages like theexact match
metric fortwitter
dataset using thenllb
translator to add a row fortwitter
there. You can find its review files in our OneDrive files under this path: LADy>LADy0.2.0.0>twitter.Also, we need to calculate the quality of our new translator (
googletranslate
) as well and you can provide the results for all the datasets (semeval-14-laptop
,semeval-14-restaurant
,semeval-15-restaurant
,semeval-16-restaurant
,twitter
) here. (In fact, this one should not be added in the README yet). I have provided you with all the review files forgoogletranslate
in our OneDrive files under this path: LADy>LADy0.2.0.1. (I am uploading the files right now so they will be available about ~2h after this issue has been posted)