Evaluating results of translation

farinamhz commented 1 year ago

Hi @Lillliant,

We have implemented and got the results of back-translation on two datasets so far, the first one is Semeval-Restaurant-2016, and the other one is Semeval-Restaurant-2015.

Now we want to evaluate the translation and back-translation results based on specific metrics used in this area. These are some examples of metrics that are more important: exact match, rouge, and bleu. However, you can search and let me know if any other metrics have been used more lately.

You can find the results of the back-translation for Semeval-2016 in: data/augmentation/back-translation and Semeval-2015 in output/augmentation/back-translation-Semeval-15

D represents the original dataset in English, D.L represents the translated dataset, and D_L represents the back-translated dataset. Now we compare D with D.L, then D.L with D_L, and finally D with D_L to find the values for those metrics.

All the texts or reviews you want to compare, whether in original, translated, or back-translated datasets, can be found in the column "sentences".

Please find the values for metrics in these two datasets for these languages: L in {fra, arb, due, spa, and zho}, which are French, Arabic, German, Spanish, and Chinese.

Feel free to let me know if you have any concerns or questions about this task.

@hosseinfani

farinamhz commented 1 year ago

Also, we have the deadline for conference submission on April 21st. Therefore, we need to gather the results asap, and I would be grateful if you let me know of your time estimation for this task. @Lillliant

Lillliant commented 1 year ago

@farinamhz I have most of my exams and marking in the next three days. I'll work on the task and try to finish it on the 16th, and guarantee finish it before the 20th if it takes longer than expected.

farinamhz commented 1 year ago

No worries @Lillliant, take your time for the exam and markings. Good luck with the exams! Just let me know if there is any progress by April 17th.

Lillliant commented 1 year ago

Hi @farinamhz @hosseinfani I've added the code for calculating the metrics in metrics.py. The preliminary result for deu can be seen in the following txt file. score-deu.txt

Particularly, I noticed that there are different kinds of ROUGE scores. For now I used ROUGE-L since it appears to find the longest "sequence" of similarity.

I will check to see if there is other metrics useful for this. For now, though, if the code seems logical, I'll generate the metrics for the remaining languages asap.

farinamhz commented 1 year ago

Thank you so much @Lillliant. Great! You can add the rest of the datasets and languages.

Lillliant commented 1 year ago

@farinamhz

I've ran the code for the rest of the languages for semeval 2016 and 2015 datasets. Here are the results: LADy back translation result SemEval 2016.csv LADy back translation result SemEval 2015.csv

For SemEval 2015, however, I noticed that the D.arb file is not in the repo and D.zho's file is empty. So, D->D.arb and D.arb->D_arb and D->D.zho and D.zho->D_zho remains uncalculated in the previous results. Do you happen to have a local copy of these files? If not, can the files be reproduced using the settings and code on commit 97fbf402c2f48962bc33960176fab58edcc12589?

farinamhz commented 1 year ago

Hi @Lillliant, Unfortunately, we had some problems with the 2015 and 2016 versions of the datasets that I gave you. Could you please redo the process for these new datasets that I have recently pushed? Also, I have added the 2014 version.

You can find all of them in these directories:

Semeval-Restaurant-2016 in: output/augmentation-R-16/back-translation
Semeval-Restaurant-2015 in: output/augmentation-R-15/back-translation
Semeval-Restaurant-2014 in: output/augmentation-R-14/back-translation

Let me know if there is any problem with these new datasets like before.

Lillliant commented 1 year ago

Hi @farinamhz,

Sure, I'll redo the process with the new datasets after my exams tomorrow and post the results here. I checked the arb and zho datasets and they don't seem to have the same issue as the old sets.

farinamhz commented 1 year ago

That would be great! Thank you so much for your @Lillliant.

farinamhz commented 1 year ago

Also, if you have time after this; Please find these:

sentences in each dataset (only three numbers are needed, one for each version, as it is the same in different languages of that version),
avg #tokens (you can simply split() them) for each dataset (including all the versions and languages, meaning that 6 for English, German, French, Arabic, and Chinese for each version)

Any other stats as a suggestion would be appreciated.

You can provide them in a table if you can.

@Lillliant

Lillliant commented 1 year ago

@farinamhz

I've added in the results, using the updated code in metrics.py.

average-sentences-tokens-R-['14', '15', '16'].csv back-translation-metrics-R-14.csv back-translation-metrics-R-15.csv back-translation-metrics-R-16.csv

farinamhz commented 1 year ago

Hi @Lillliant Great! Thank you very much.

farinamhz commented 1 year ago

Could you please also add the avg number of tokens (I mean avg #tokens for each sentence) in each dataset? We only have the total number of tokens in each of them now. You can add a new column beside others for that.

@Lillliant

farinamhz commented 1 year ago

Also, please add the avg number of tokens in the sentences and the number of all sentences in the "All languages dataset" for each version. You can find ALL in these files:

all-Semeval-Restaurant-2016 in: output/augmentation-R-16/augmented-with-labels/All.back-translated.with-labels.csv
all-Semeval-Restaurant-2015 in: output/augmentation-R-15/augmented-with-labels/All.back-translated.with-labels.csv
all-Semeval-Restaurant-2014 in: output/augmentation-R-14/augmented-with-labels/All.back-translated.with-labels.csv

@Lillliant

Lillliant commented 1 year ago

Could you please also add the avg number of tokens (I mean avg #tokens for each sentence) in each dataset? We only have the total number of tokens in each of them now. You can add a new column beside others for that.

@Lillliant

Hi @farinamhz, can you clarify on the avg number of tokens? I thought that the metrics I had was the number of tokens each sentence have on average, not the number of tokens the dataset have in total.

Lillliant commented 1 year ago

Also, I've calculated the numbers for the All languages dataset. Does the avg # of tokens look alright? All-lang-average-sentences-tokens-R-['14', '15', '16'].csv

farinamhz commented 1 year ago

Thank you! Yes that is what I meant, average number of tokens for the sentences in a dataset. Now we want this average for each of the datasets that we have just beside the total tokens and sentences you can add a new column in that file. (Like what you did for all languages.) @Lillliant

Lillliant commented 1 year ago

@farinamhz

I've added the calculation for the original dataset, translated dataset, and back-translated dataset for each of SemEval 2014/15/16 for the 5 languages (also the average between the original and back-translated, in case it is needed).

Also, I found a minor bug in the past results for the Chinese translated dataset: because the language structure uniquely doesn't have any spaces between words, the program was mistakenly considering the entire sentence / sentence fragments as one token. The more accurate version of considering each "character" as tokens is reflected in this result. However, more accurate results would need some form of analysis to determine which "characters" are grouped into words.

average-sentences-tokens-R-.14.15.16.csv

farinamhz commented 1 year ago

Thank you, @Lillliant! Interesting, I did not know that! You can search for tokenization in Chinese if you have time for this case.

hosseinfani commented 1 year ago

Hi @farinamhz and @Lillliant

Thank you very much for your clean and readable codelines. I am getting close to the end of my code refactor :D Just 2-3 more days and I'm gone :DD

Regarding the stat and metric codelines:

I did some refactor to the metrics.py and distribution.py. Basically, I removed them :D and merged them into review.py class. The Review class accepts a pickle of reviews and generates the stats about the reviews and their backtranslated versions (we don't need the translation versions) and distributions:

https://github.com/fani-lab/LADy/blob/a2661fc8e1a070f8c04a3bb0a92a96460f1e9d6a/src/cmn/review.py#L135

For the backtranslation metrics, I created a method for Review. That is, we ask a review give me your backtranslated metric and it kindly returns a dictionary of values:

https://github.com/fani-lab/LADy/blob/a2661fc8e1a070f8c04a3bb0a92a96460f1e9d6a/src/cmn/review.py#L82

Also, I put a main_stat.py driver code to get the stats on all datasets followed by an aggregation in a ../output/semeval+/stats.csv file for easy presentation in the paper:

https://github.com/fani-lab/LADy/blob/main/src/main_stat.py

I need @Lillliant to do the following please:

fix the zeros here in these lines:

https://github.com/fani-lab/LADy/blob/a2661fc8e1a070f8c04a3bb0a92a96460f1e9d6a/src/cmn/review.py#L145 https://github.com/fani-lab/LADy/blob/a2661fc8e1a070f8c04a3bb0a92a96460f1e9d6a/src/cmn/review.py#L171 https://github.com/fani-lab/LADy/blob/a2661fc8e1a070f8c04a3bb0a92a96460f1e9d6a/src/cmn/review.py#L172

check/fix the stat about category as I merged it into same place but I didn't check the values and logic.
double-check the other stats and plots since my code refactor may break the logic and output

Thank you.

hosseinfani commented 1 year ago

@farinamhz @Lillliant I think we can close this issue. Let me know otherwise.

fani-lab / LADy

Evaluating results of translation #33

sentences in each dataset (only three numbers are needed, one for each version, as it is the same in different languages of that version),