consolidate evaluation - Githubissues

chiarcos commented 4 years ago

At the moment, every MTAAC/CDLI MT system is independently evaluated, so that it is impossible to track progress.

e.g., Rachit's (2020) "mu usz-bar x 2(disz) tug2 usz-bar tur" seems to correspond to two independent (!) lines in Ravneet's (2019) system:

544,mu ucbar X 542, NUMB tug ucbar tur sumun

But it's likely that these are actually completely different texts (and that there is no overlap for the phrase "ucbar tur" / "usz-bar tur" in their data), because "sumun" is not in Rachit's text, and then, the systems are just incomparable.

Establish a consistent train/test set and replicate.

RachitBansal commented 4 years ago

Hi @chiarcos,

You are absolutely right that these systems are incomparable. @RavneetDTU and I had a conversation regarding this in the initial phases of the project and it was decided that it was necessary.

The data used in 2019 is actually quite different from what we prepared this time and this change was required in order to build better systems. The preprocessing used is different and we have tested on various configurations of sentence structure.

Thus, we aren't attempting at comparing them and haven't done so in the publication either (the draft of which I had shared with you).

chiarcos commented 4 years ago

Hi Rachit,

maybe the easiest thing is to run the old system over the new data.

In the overall context of CDLI, it would be good to compare, and a
practical application would be to suggest multiple candidate translations
to the editor of an ATF file. At the moment, it seems machine translations
would have to be moderated anyway.

In this scenario, the old models remain relevant (including the 2015 SMT
baseline). From the examples I looked on, they are sometimes doing quite
well. For the example sentence I sent, the seq2seq baseline did "year : “
year ”…” . 2 “ ušbar ” garments ( s )". This is effectively better
than any other translation because it just kept it as untranslatable.
(All systems are led astray by "mu" at the beginning, but some of the
new models do pretty wild things: XLM_MLM_TLM just generated a full year
name: "year the martu wall was erected usuen strong king", and
10xAugmentedPredsXLM became creative "the weaver sorcerer an oath
incantation".) (BTW: For this example, the btPreds models are okish, here,
too.)

chiarcos commented 4 years ago

Official train/dev/test split for parallel data is under https://github.com/cdli-gh/mtaac_cdli_ur3_corpus/blob/master/ur3_corpus_data/corpus_split_translated_20180514-125709.json Official train/dev/test split for all data is under https://github.com/cdli-gh/mtaac_cdli_ur3_corpus/blob/master/ur3_corpus_data/corpus_split_20180418-225438.json.

RavneetDTU commented 4 years ago

Dear all,

Last year we applied Transformer model on dataset, BLEU Score around 19.

This year we also used the Transformer model, with same configaration (6 layer), but dataset is quite different, so on same architecture and configuration results are changing a lot.

Overall the architecture is same this year too.

On Sun, 27 Sep, 2020, 10:46 pm Christian Chiarcos, notifications@github.com wrote:

Hi Rachit,

maybe the easiest thing is to run the old system over the new data.

In the overall context of CDLI, it would be good to compare, and a practical application would be to suggest multiple candidate translations to the editor of an ATF file. At the moment, it seems machine translations would have to be moderated anyway.

In this scenario, the old models remain relevant (including the 2015 SMT baseline). From the examples I looked on, they are sometimes doing quite well. For the example sentence I sent, the seq2seq baseline did "year : “ year ”…” . 2 “ ušbar ” garments ( s )". This is effectively better than any other translation because it just kept it as untranslatable. (All systems are led astray by "mu" at the beginning, but some of the new models do pretty wild things: XLM_MLM_TLM just generated a full year name: "year the martu wall was erected usuen strong king", and 10xAugmentedPredsXLM became creative "the weaver sorcerer an oath incantation".) (BTW: For this example, the btPreds models are okish, here, too.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English/issues/17#issuecomment-699662298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAVZHCVHRRRJ6CE37T6XZLSH5XQBANCNFSM4R3UAT2Q .

cdli-gh / Semi-Supervised-NMT-for-Sumerian-English

consolidate evaluation #17