Updating stats on quality of translation

farinamhz commented 6 months ago

Hi @Lillliant,

As you move forward with updating the README, you will need to calculate the metrics that we previously had for translation between English and other languages like the exact match metric for twitter dataset using the nllb translator to add a row for twitter there. You can find its review files in our OneDrive files under this path: LADy>LADy0.2.0.0>twitter.

Also, we need to calculate the quality of our new translator (googletranslate) as well and you can provide the results for all the datasets (semeval-14-laptop, semeval-14-restaurant, semeval-15-restaurant, semeval-16-restaurant, twitter) here. (In fact, this one should not be added in the README yet). I have provided you with all the review files for googletranslate in our OneDrive files under this path: LADy>LADy0.2.0.1. (I am uploading the files right now so they will be available about ~2h after this issue has been posted)

Lillliant commented 5 months ago

Hi @farinamhz, I've calculated the statistics for twitter dataset and google translate's review files, which I have uploaded to the OneDrive paths under LADy0.2.0.1 > statistics.

farinamhz commented 5 months ago

Previous conversation:

[Thursday 9:16 PM] Christine Wong Hi Farinam Hemmati Zadeh, I unfortunately cannot make it to Friday's progress meeting this week, but I've added my progress to the issues pages and have made a PR so it can be reviewed. In addition to the questions there, I've also noticed that the exact match metrics calculated using LADy0.2.0.0's semeval datasets are different from the semeval+ statistics. Would this be something to be concerned about?

[Friday 11:59 AM] Farinam Hemmati Zadeh Hey Christine, no worries! Thank you very much for the update and your work! You mean the newly translated reviews are different from the previously translated ones, right? If so, it is right as the translated reviews for LADy0.2.0.1 are from a new translator (googletranslate).

[Friday 12:42 PM] Farinam Hemmati Zadeh Christine Wong I just realized that you said LADy0.2.0.0. LADy0.2.0.0 should be the same as before as only twitter has been added to this version. However, I wanted you to calculate the metrics for LADy0.2.0.1 which is for googletranslate results. I think there was a confusion between these two. Which one have you calculated now? [Friday 12:44 PM] Farinam Hemmati Zadeh In fact, previous results should not have any significant difference in compare with LADy 0.2.0.0. Did they have? Christine [Friday 9:06 PM] Christine Wong Hi Farinam Hemmati Zadeh, I've calculated the twitter metrics based on the LADy0.2.0.0 which was put into the readme.md. I've also calculated all the datasets (semeval + twitter) for the googletranslate results in LADy0.2.0.1 which are not in readme.md but uploaded to OneDrive.

[Friday 9:16 PM] Christine Wong The results aren't too different (around 0.01 difference compare from LADy0.2.0.0 to the data in the readme.md), but I was wondering if it was alright to "mix" the results together, since it seems the twitter metrics may look different if it was produced at the same time as the metrics from the readme.

[Friday 9:19 PM] Christine Wong I've also attached a run with the semeval15/16's result for better comparison: it seems like the newer version have higher em metric, which might be a good thing since it might suggest similar sentence structure, etc.?

farinamhz commented 5 months ago

Hey @Lillliant, Let's continue here. I appreciate the updates you've provided. Everything is going well, but I'm facing some confusion regarding the calculation of BLEU and ROUGE scores. It appears that when dealing with longer tweets with diverse contexts, they do not yield accurate exact match results in comparison with semeval datasets. Perhaps we should consider exploring alternative metrics like BLEU and ROUGE in this context. So I want to make sure what are the inputs of BLEU and ROUGE.

Lillliant commented 5 months ago

Hi @farinamhz, sure! I've attached my run of the twitter (LADy0.2.0.0) dataset here:

For bleu metrics:	dataset	pes_Arab_bleu	zho_Hans_bleu	deu_Latn_bleu	arb_Arab_bleu	fra_Latn_bleu
twitter	0.2110	0.1892	0.4025	0.33383	0.3891	0.4439
semeval-2016-restaurant	0.3746	0.3065	0.5435	0.4465	0.5314	0.5864
semeval-2015-restaurant	0.3787	0.3080	0.5514	0.4523	0.5318	0.5895

For rouge metrics:	dataset	pes_Arab_rouge_f	zho_Hans_rouge_f	deu_Latn_rouge_f	arb_Arab_rouge_f	fra_Latn_rouge_f
twitter	0.1889	0.1677	0.3307	0.2589	0.3117	0.3596
semeval-2016-restaurant	0.2802	0.2224	0.4258	0.3360	0.4089	0.4628
semeval-2015-restaurant	0.2783	0.2224	0.4332	0.3387	0.4076	0.4661

Here, the bleu metrics we had was obtained by computing the average of the bleu score means for each sentence calculated using weight=[(1.0,), (0.5, 0.5), (0.3333, 0.3333, 0.3333)].

The rouge metrics are obtained by computing the average of the F1-score means from rouge-1 to rouge-5 for each sentence.

fani-lab / LADy

Updating stats on quality of translation #68