N-gram matching score across languages

Akella17 commented 5 years ago

@glample @aconneau Is there any empirical result that determines the quality of cross-lingual embeddings generated by the XLM model without estimation of additional parameters or finetuning? Like maybe some degree of correlation with a human-annotated score or a standard evaluation metric.

I tried to check the performance of XLM for evaluating text generation by following the steps described in BertScore (https://arxiv.org/abs/1904.09675). While the n-gram score of XLM15 is consistent with BERT's scores for monolingual case (mt_output and reference), the same experiment with cross-lingual (source and mt_output) n-gram scores has near zero correlation with the human-annotated score.

The success of XLM at various cross-lingual tasks like XNLI, UNMT, etc might mean that there must exist some kind of non-linear relationship between the XLM embeddings of different languages (since linear correlation is nearly 0). What is your opinion on this and is there any way to use XLM as a cross-lingual scoring metric?

aconneau commented 5 years ago

@Akella17 : I haven't read the paper you mention yet, but have you tried some Hungarian algorithm on the hidden states of the (say) the French hidden states and the English hidden states? It's possible that there exists a nonlinear mapping and the existence of such a mapping is an interesting question imo.. Having something similar to Procrustes (linear mapping from source to target) that we had for word embeddings is to me highly unlikely.

aconneau commented 5 years ago

There was a bug in your other issue #99, so it's also possible that the zero-correlation comes from that bug.

Akella17 commented 5 years ago

Even I feel that it is highly unlikely for a linear mapping to exist between cross-lingual embeddings. That is exactly why I wanted to try finetuning in the first place. Since regression-based fine-tuning allows us to learn a non-linear CLS representation that is a function of all the layer representation (except the final layer), it is likely that this model can represent such cross-lingual relationships (at least BERT regression gives good results for monolingual (mt_output and reference) case).

Hey, can you mention what was the bug is in #99 issue? I have shared the code for the #99 issue so that you can reproduce the error. Setting --ref_flag to True signals that you are passing the reference instead of the source sentence.

aconneau commented 5 years ago

I'll look at the #99 issue asap!

facebookresearch / XLM

N-gram matching score across languages #111