facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark
Other
688 stars 123 forks source link

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

Open Fikavec opened 3 years ago

Fikavec commented 3 years ago

Thanks for Open-Source The FLORES-101 Data Set. While working with him, I noticed a certain feature that I wanted to share here. Some languages contain Alternative Spelling rules therefore some words that have more than one accepted spelling. This is known-well feature for German, Danish, Swedish or Traditional to Simplified Chinese conversion rules, etc. For example, in german language alternative spelling rules are: alternatespelling Consider the sentence № 991 from Flores-101 dev (deu.dev and eng.dev):

In this sentence: alternatespelling_example Therefore sentence:

is fully equivalent of sentence (1), but not for Ai (I’m see this for many Ai translation and embeddings models):

  1. words != words in alternative spelling (Straße != Strasse) for Ai alternatespelling_laser_words

  2. during training, the meaning of the context for words Straße, Strasse gets distorted or lost

  3. for Ai after learning on unbalanced by Alternative Spelling CC + Wiki dataset: Straße = street and Strasse != street OR Straße != street and Strasse = street OR Straße != street and Strasse != street alternatespelling_laser_words_variants

How about extend Flores-101 (or create additional dataset) with sentences in Alternative Spelling for languages contains this rules for benchmarking (and create metric for measure quality) in two cases:

  1. How equals words/sentences in Alternative Spelling in one language (is Straße == Strasse for model) for languages with Alternative Spelling rules
  2. How equals words/sentences in Alternative Spelling in crosslanguages case (as I showed above) German sentence == English sentence VS Alternative Spelling German sentence == English sentence

How about extend Flores-101 (or create additional dataset) with metric or test cases for measure quality of aligning language spaces for embeddings models like LASER, USE etc… for tasks other than machine translation: multilingual Ai tasks (scientific problems) like classification, similarity measure, BUCC, few-shot multilingual learning as discuss there?

guzmanhe commented 3 years ago

Hi @Fikavec, thank you for providing such an extensive and detailed feedback. I'm not an expert in alternative spelling, but will consult with some of our linguists to get a more informed opinion. However, do you have a sense of what is the expectation of a reader? For example, in a formal context, will people consider Straße equally acceptable than Strasse? From looking at your analysis, it seems that one is more common than the other. The first seems to be more prevalent than the second (e.g. LASER trained on parallel corpus prefers the first over the second).

For your specific proposals:

Fikavec commented 3 years ago

Thanks for your reply and work, @guzmanhe. My experiments with LASER, USE, LaBSE, distiluse, XLM-R, M2M100 show that the problem of "Alternative Spelling rules" has not yet been solved when AI training and assessment. The rules of alternative spelling are, in fact, the replacement of some characters by others according to tables officially accepted by people in some countries, but no more. Words with an alternative spelling and without it are not synonyms or close words, they are the 100% same words, but modern multilingual AI's do not know this (and do not learn it at training time) and therefore make mistakes (have less quality) at the level of sentences containing such words, be then embedding models, pre-trained language models or translators, because they consider them synonymous or similar words, and not 100% identical words, obtained by replacing characters according to some table of rules. But for people, the situation is different and no less interesting.

However, do you have a sense of what is the expectation of a reader?

Using German as an example, is the reader a native speaker or has he studied it in other country? If he is a native speaker, is he a German or a Swiss, and is he old or young? He reads an old book or a modern newspaper? - it all matters when it comes to the expectations of the reader. From wikipedia:

In Swiss Standard German, "ss" usually replaces every "ß". This is officially sanctioned by the reformed German orthography rules, which state in §25 E2: "In der Schweiz kann man immer „ss“ schreiben" ("In Switzerland, one may always write 'ss'"). Liechtenstein follows the same practice.

Thus, a Swiss, when reading a modern newspaper in German, certainly does not expect to see Straße. etc. It seems to me that solving this problem for every language with linguists will be the next step in the development of multilingual AI, and the first step could be creating a specialized test for this problem. In your opinion, the solution to this problem should be carried out at the preprocessing / tokenization stage (replace all alternative spellings so that the AI model always receives words in only one spelling, both during training and during inference) or using augmentation (to balance the training sample words with and without alternative spelling) or methods related to the learning process (fine-tuning on "Alternative Spelling rules") or network architecture (so that a model with "special alternative spelling layers" can learn the rules for replacing tables and give the same outputs for this)?

You can use FLORES for evaluation of LASER, MUSE etc.

How to use FLORES for evaluation multilingual embeddings models (LASER, USE, LaBSE, distiluse), could you suggest some suitable metric for this?