In this repository, we will see different methods to achieve similarity search across multiple languages. We aim to compare:
We compare the below models on the publicly available newstest2012 from WMT 2012.
For each sentence in the source language, we calculate the closest sentence in the joint embedding space in the target language. If this sentence has the same index in the file, it is considered as correct, and as an error else wise. Therefore, the N-way parallel corpus should not contain duplicates.
cs | de | en | es | fr | avg | |
---|---|---|---|---|---|---|
cs | 0.00% | 0.70% | 0.90% | 0.67% | 0.77% | 0.76% |
de | 0.83% | 0.00% | 1.17% | 0.90% | 1.03% | 0.98% |
en | 0.93% | 1.27% | 0.00% | 0.83% | 1.07% | 1.02% |
es | 0.53% | 0.77% | 0.97% | 0.00% | 0.57% | 0.71% |
fr | 0.50% | 0.90% | 1.13% | 0.60% | 0.00% | 0.78% |
avg | 0.70% | 0.91% | 1.04% | 0.75% | 0.86% | 1.06% |
Czech language is currently not covered by sentence-transformers. More experiments are coming to do a thorough comparison. Hence, the following is a better view of what performance can achieve sentence-transformers using as base model 'distiluse-base-multilingual-cased'
cs | de | en | es | fr | avg | |
---|---|---|---|---|---|---|
cs | NA | NA | NA | NA | NA | NA |
de | NA | 0.00% | 1.40% | 1.60% | 1.53% | TODO |
en | NA | 1.47% | 0.00% | 1.33% | 0.97% | TODO |
es | NA | 1.80% | 1.10% | 0.00% | 1.03% | TODO |
fr | NA | 1.73% | 0.97% | 1.07% | 0.00% | TODO |
avg | NA | TODO | TODO | TODO | TODO | TODO |
Sentence-transformers outperform LASER on the pair (French, English), (English, French).
Otherwise, LASER shows higher performance.
Russian is considered in our analysis. LASER has not published in their repo the results for Russian. We will reproduce their experiments soon and include Russian for rigurous analysis.
langs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|
de | 0.00% | 1.40% | 1.60% | 1.53% | 2.60% | 1.78% |
en | 1.47% | 0.00% | 1.33% | 0.97% | 0.50% | 1.07% |
es | 1.80% | 1.10% | 0.00% | 1.03% | 2.13% | 1.52% |
fr | 1.73% | 0.97% | 1.07% | 0.00% | 2.00% | 1.44% |
ru | 2.76% | 0.50% | 2.23% | 2.03% | 0.00% | 1.88% |
avg | 1.94% | 0.99% | 1.56% | 1.39% | 1.81% | 1.54% |
We can also do a Zero-Shot encoding and consider CS language.
The following shows such results:
langs | cs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|---|
cs | 0.00% | 39.46% | 37.63% | 38.20% | 38.99% | 41.69% | 39.19% |
de | 36.10% | 0.00% | 1.40% | 1.60% | 1.53% | 2.60% | 8.64% |
en | 34.17% | 1.47% | 0.00% | 1.33% | 0.97% | 0.50% | 7.69% |
es | 35.36% | 1.80% | 1.10% | 0.00% | 1.03% | 2.13% | 8.29% |
fr | 35.96% | 1.73% | 0.97% | 1.07% | 0.00% | 2.00% | 8.34% |
ru | 37.96% | 2.76% | 0.50% | 2.23% | 2.03% | 0.00% | 9.10% |
avg | 35.91% | 9.44% | 8.32% | 8.88% | 8.91% | 9.78% | 13.54% |
With mean pooling, MAX_LEN = 100:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 84.85% 87.71% 86.85% 88.71% 87.03%
de 75.06% 0.00% 64.20% 70.36% 69.13% 69.69%
en 74.16% 54.28% 0.00% 24.84% 29.14% 45.60%
es 75.99% 81.62% 40.86% 0.00% 50.95% 62.35%
fr 90.08% 78.16% 37.53% 56.88% 0.00% 65.66%
avg 78.82% 74.73% 57.58% 59.73% 59.48% 66.07%
With mean pooling, MAX_LEN = 50:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 84.18% 85.98% 82.65% 89.08% 85.47%
de 74.79% 0.00% 61.51% 65.23% 67.27% 67.20%
en 75.72% 55.04% 0.00% 26.71% 31.60% 47.27%
es 78.39% 81.75% 37.40% 0.00% 55.54% 63.27%
fr 75.36% 63.20% 35.16% 34.33% 0.00% 52.01%
avg 76.07% 71.05% 55.01% 52.23% 60.87% 63.05%
With cls pooling, MAX_LEN = 100:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 89.84% 98.07% 89.88% 91.94% 92.43%
de 72.86% 0.00% 88.68% 71.33% 73.63% 76.62%
en 81.55% 64.47% 0.00% 42.99% 51.85% 60.21%
es 77.46% 78.82% 85.55% 0.00% 60.11% 75.48%
fr 75.42% 76.76% 83.82% 50.18% 0.00% 71.55%
avg 76.82% 77.47% 89.03% 63.59% 69.38% 75.26%
With cls pooling, MAX_LEN = 50:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.51% 98.30% 91.04% 93.81% 93.66%
de 75.02% 0.00% 89.68% 73.16% 76.92% 78.70%
en 82.78% 67.87% 0.00% 49.32% 56.51% 64.12%
es 78.69% 81.68% 86.38% 0.00% 64.84% 77.90%
fr 76.72% 80.22% 85.18% 54.18% 0.00% 74.08%
avg 78.31% 80.32% 89.89% 66.92% 73.02% 77.69%
With max_len = 100, (the higher the better according to experiments)
langs | cs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|---|
cs | 0.00% | 74.46% | 74.63% | 71.86% | 47.45% | 42.72% | 62.22% |
de | 48.42% | 0.00% | 32.83% | 35.23% | 35.30% | 40.93% | 38.54% |
en | 53.28% | 37.70% | 0.00% | 22.91% | 25.57% | 24.18% | 32.73% |
es | 40.26% | 48.68% | 19.35% | 0.00% | 25.87% | 36.56% | 34.15% |
fr | 65.73% | 50.35% | 20.85% | 37.30% | 0.00% | 31.73% | 41.19% |
ru | 41.56% | 42.09% | 21.51% | 34.23% | 34.73% | 0.00% | 34.83% |
avg | 49.85% | 50.66% | 33.83% | 40.31% | 33.79% | 35.22% | 40.61% |
Without russian, (for easier comparison with Baseline)
langs | cs | de | en | es | fr | avg |
---|---|---|---|---|---|---|
cs | 0.00% | 74.46% | 74.63% | 71.86% | 47.45% | 67.10% |
de | 48.42% | 0.00% | 32.83% | 35.23% | 35.30% | 37.95% |
en | 53.28% | 37.70% | 0.00% | 22.91% | 25.57% | 34.87% |
es | 40.26% | 48.68% | 19.35% | 0.00% | 25.87% | 33.54% |
fr | 65.73% | 50.35% | 20.85% | 37.30% | 0.00% | 43.56% |
avg | 51.92% | 52.80% | 36.91% | 41.82% | 33.55% | 43.40% |
@TODO: CHECK WHY XLM-R shows such poor performance
Mean Pooling Strategy is the strategy with the best performance. Refer to this issue for comparison when using CLS Pooling Strategy. Input: MAX_LEN = 40, Mean Pooling Strategy
sh similarity_XLM-R_batch.sh 40 mean True
Output:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.31% 97.64% 98.37% 94.34% 95.41%
de 93.84% 0.00% 88.05% 92.87% 95.40% 92.54%
en 91.21% 77.79% 0.00% 72.56% 94.77% 84.08%
es 95.47% 93.14% 61.84% 0.00% 90.28% 85.18%
fr 91.97% 81.25% 77.12% 71.96% 0.00% 80.58%
avg 93.12% 85.87% 81.16% 83.94% 93.70% 87.56%
Input: MAX_LEN = 100, Mean Pooling Strategy
sh similarity_XLM-R_batch.sh 100 mean True
Output:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.24% 97.77% 97.80% 92.11% 94.73%
de 94.31% 0.00% 86.55% 92.44% 94.67% 91.99%
en 91.08% 74.93% 0.00% 71.56% 77.69% 78.81%
es 95.37% 88.38% 58.54% 0.00% 90.38% 83.17%
fr 92.71% 91.58% 87.41% 89.14% 0.00% 90.21%
avg 93.36% 86.53% 82.57% 87.74% 88.71% 87.78%
In the following, we show how to replicate such results.
For linux users: Use script wmt.sh
sh wmt.sh
We assume that tools like torch, tqdm, etc are already installed. On Google Collab, it is the case. (cf. examples --coming soon)
Use script install-tools.sh
sh install-tools.sh
When preprocessing data prior to encoding, we experienced lower performance. Hence, we do not recommend to do it. If one still wants to preprocess data, please use prepare-data-wmt bash file. Then, some small changes need to be added, so that input file can be found in the proper directory. We may add an argument for --input_file_name in the future.
Use script prepare-data-wmt.sh
sh prepare-data-wmt.sh
At the moment, sentence-transformers on "distiluse-base-multilingual-cased" only covers Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
Hence, Czech is not covered. We still decide to print the errors for language Czech for the seek of experiment on pure Zero-Shot learning. However, for comparison with baseline, we focus on other pairs of languages.
sh similarity_sentenceBERT.sh
In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.
MAX_LEN is an integer parameter that is describing the number of tokens to consider when encoding.
For each sentence with less tokens than MAX_LEN, we use zero-padding.
For each sentence with more token than MAX_LEN, we ignore all tokens after MAX_LEN index.
Default value is 40. Based on prior statistics on preprocessed data for French, we have:
We recommend to increase MAX_LEN for experiments. Larger MAX_LEN gives better performance but slower computation.
sh similarity_XLM.sh MAX_LEN
In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.
Parameters:
sh similarity_XLM-R.sh MAX_LEN POOLING_STRAT
Batch version:
sh similarity_XLM-R_batch.sh MAX_LEN POOLING_STRAT GPU