LaBSE Sentence Embeddings model output vectors are not equal to original

Fikavec commented 3 years ago

I’am using embeddings from example https://nlp.johnsnowlabs.com/2020/09/23/labse.html and output vectors although close, but not equal to original vectors https://tfhub.dev/google/LaBSE/1 Why? How original vectors converted to this model, maybe original model modified or finetuned or spark-nlp using different normalization? - on example multiclass/multilabel tasks spark-nlp LaBSE embeddings work differently than original vectors, how get this vectors from original model? Original vectors are case sensitive, but in spark-nlp config this vectors are case insensitive, original vectors max_seq_length = 64, but in spark-nlp = 128... any other differences? For re-produce problem: run code from original vectors page with ["I love NLP", "Many thanks"] and compare outputs to outputs from spark-nlp model page. How open spark-nlp model by tf.saved_model.load after unzip pb file and run infer in tensorflow for "low-level" compare outputs and analyse problem (tensorflow 2.x can't load unzipped bert_sentence_tensorflow model) - maybe difference/problem occurs in converted tf model?

difference

Notebook to reproduce: Compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip

Fikavec commented 3 years ago

We can see differences in embeddings on examples which mean the same thing from Spark-nlp 20 langs multilingual classification demo:

Science has advanced rapidly over the last century
Die Wissenschaft hat im letzten Jahrhundert rasante Fortschritte gemacht
在上个世纪，科学发展迅速
Die wetenskap het die afgelope eeu vinnig gevorder
Khoa học đã phát triển nhanh chóng trong thế kỷ qua
科学は前世紀にわたって急速に進歩しました
Isayensi ithuthuke ngokushesha ngekhulu leminyaka elidlule
Bilim, geçen yüzyılda hızla ilerledi
המדע התקדם במהירות במהלך המאה האחרונה
గత శతాబ్దంలో సైన్స్ వేగంగా అభివృద్ధి చెందింది
Наука стремительно развивалась за последнее столетие
سائنس گذشتہ صدی کے دوران تیزی سے ترقی کرچکی ہے
"विज्ञान पिछली सदी में तेजी से आगे बढ़ा है
Соңгы гасырда фән тиз үсә
La science a progressé rapidement au cours du siècle dernier
วิทยาศาสตร์ก้าวหน้าอย่างรวดเร็วในช่วงศตวรรษที่ผ่านมา
វិទ្យាសាស្ត្របានជឿនលឿនយ៉ាងលឿនក្នុងរយៈពេលមួយសតវត្សចុងក្រោយនេះ
וויסנשאַפֿט איז ראַפּאַדלי אַוואַנסירטע איבער די לעצטע יאָרהונדערט
Илим акыркы кылымда тездик менен өнүккөн
கடந்த நூற்றாண்டில் அறிவியல் வேகமாக முன்னேறியுள்ளது

Spark-nlp LaBSE embeddings cosine similiraty heatmap spark_cossim Original TF-Hub LaBSE embeddings cosine similiraty heatmap with do_lower_case = False and max_seq_length = 128 labse_cossim

Colab сode to reproduce: visual_multi_lingual_compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip

As we can see maximum difference Spark-nlp from Original TF-hub model in sentence 4 on Vietnamese language. This means that the Spark-nlp and Original models difference is not only in max_seq_length, lowercased sentences, but also in tokenization process and other (using tf.float64, padding short squences by 0f...)?

Fikavec commented 3 years ago

For simplify debugging issue there are attahed csv with sorted sentences by distance from Spark-nlp embeddings to Original Tf-hub model and colab for reproduce: compute_sentence_difference_distance.zip diff

Fikavec commented 3 years ago

With activated tf.float64 and setting max_seq_length = 128 and lowercased input sentences to Original TF-Hub model (! lowercased input sentences, but don't touch do_lower_case option in original model example code, because after setting do_lower_case=True this model makes much worse embeddings than lowercased input) on tasks from multi_class_text_classification and test examples from sections "Model with 20 languages!" model can achieve (given results from run to run is not stable because train_test_split, without setting random sid, model iitialization etc) the following results:

Amazon In Spark-nlp example:
Tripadvice In Spark-nlp example:
News In Spark-nlp example:

Maybe main diferrence from Original model in Spark-nlp multilingual tokenization procedure and embeddings difference appears ony when input sentences are on some languages: Vietnamese, Japan, Urdu... ?

maziyarpanahi commented 3 years ago

These are very interesting findings, thanks for sharing them. A couple of things come to my mind:

Some explanations:

The case sensitivity in Spark NLP is defined during runtime, depending on the param being set by the user it will lowercase already tokenized text or leave it as it is. Then it tries to tokenize the tokens by BERT tokenizer to encode the piece ids. So you can change it depending on the model if it supports cased or uncased. (you can test both these scenarios)
Same goes for max_seq_length, you can change that during runtime as well in Spark NLP via setMaxSentenceLength. The BERT models support up to 512 so you can choose the length depending on your needs, like matching two comparisons.
The actual vectors (floats) can be different from the original since there is a conversion and change of environment. However, this should change much in the actual end results
We also didn't change anything in the original model, as far as I know, we just simply used saved_model and then loaded and saved it in Spark NLP without any change.

Some strong possibilities:

What you have suspected could one of the reasons or the reason for this inequality. If you can do your F1 based on each language it may reveal that our custom BERT tokenizer may not do as good in some languages as the original BERT tokenizer from TF Hub. (which if this is the case we will fix it)
Another possibility is that unlike normal use of BERT where the string goes to the tokenizer, breaks into word pieces, then they are encoded to piece ids, etc. we have to first tokenize the string via Tokenizer, then pass those tokens to BERT Tokenizer, which breaks to word pieces, then piece ids, etc. and then the vectors are map back to the original tokens (custom). All the NLP tasks in Spark NLP rely on the Tokenizer so we must align every custom tokenization via BERT or BPE or SentencePiece back to the original so it can be used in Classification or NER. This may cause different vectors because of what goes before or after a token etc.

I suspect +2/-2 up and down F1-score in English is caused by the custom Tokenization (maybe use RegexTokenizer for simple tokenization by whitespace instead of Tokenizer) and the big difference in the F1-score within multi-lingual is caused by bad multi-lingual tokenization (to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP).

Will continue debugging since this is a very useful multi-lingual embedding.

maziyarpanahi commented 3 years ago

just tagging you @C-K-Loan as an FYI to be sure LaBSE bad performance is due to bad BERT multi-lingual tokenization. (we always used English so we might have never noticed)

Fikavec commented 3 years ago

Thanks for your reply. For this reason

to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP

in order to avoid such hard-to-find differences / inaccuracies in the process of integrating multilingual models in the future and to assess the quality of multilingual embeddings, I propose two methods: 1) Carefully prepare a few groups of sentences which mean the same thing in 100+ languages (sentences about the same thing should be simple and unambiguous without modern vocabulary, hand-picked from texts or from native speakers, but not from an automatic translator). For each group draw cosine heatmap for visual compare as i do here. Heatmaps of original and integrated models can be visual compared to detect differences in some languages. At the same time in a quality multi-lingual embeddings for simple unambiguous thing all squares should be closer to 1 (and lighter). 2) Make automated tests for compute distances from original models to converted/integrated on sentences in 100+ langs (supported by integrated models) like 1) as i do here . For example:

Step 1: Before some original model conversion get it's outputs from tokenizer and significant for this model and some tasks layers (for LaBSE - pooled outputs, for BERT's - CLS token and word-level embedings) for test sentences on 100+ langs from original high-level input procedure (sentence tokenization, word normalization, bpe or SentencePiece...) and save this outputs (to numpy .npy or other format.
Step 2: After model conversion write high-level automated test (from input sentences to new framework -> sentence tokenization -> word normalization...) that gives outputs from converted model for this test sentences on 100+ langs and compute distances from early saved original model outputs. If total original and converted model difference (as sum(distances)) > threshold that test fails. We need threshold because as you noted earlier, a small difference is acceptable. If test fails - sorting by max original-conversion distance or visual compare as I suggested above may help to fast debugging difference reason on some languages.
Step 3: Maybe make fast auto microtests from above for end-user load model functions, because multilingual unicode sentence -> python <- scala -> <- tf_endpoints may works a little differently on different platforms (windows, linux, osx etc) and dll's (python (+MKL on windows, or +BLAS/ATLAS on unix), java version etc) and hardware (cpu (intel, or amd etc), gpu (+CUDA version) ) and the fact of successful loading of the model on user system by call model.load(path) does not mean that the model will work correctly and it's outputs within the acceptable threshold boundaries because of sum of little differences - print warnings for user.

I think these techniques will be useful not only for solving this differences in LaBSE model, but for integrating any multilingual models in the future (like USE, LASER, Sentence Transformers models etc) as a simple fast standard tests of the correctness of integration process. All F1 presented here i'm get for sentences at 20 languages (not in english sentences) from datasets - test_df[["test_sentences","y"]].iloc[:100].

Fikavec commented 3 years ago

For test examples from section "Model with 20 languages!" from 2 Class Stock Market Sentiment Classifier Training i got the following results with original tf-hub model embeddings (+max_seq_length = 128): stock In Spark-nlp example: stock_spark

maziyarpanahi commented 3 years ago

Thanks, we will investigate this further. The big difference between macro and the micro in Spark NLP indicates some classes did good but some did very bad. Which means the tokenizer may not be doing so great on some languages. That’s being said, some languages don’t have any whitespace and require segmentations such as Chinese, Korean, Japanese, etc. For those we have WordSegmenter annotator which is also one of the issues that these languages are not being tokenized at all by not using WordSegmenter instead of a normal Tokenizer. (That will punish the accuracy in total)

@C-K-Loan Could you please do the test by separating those languages and do them in two different groups? (Some with Tokenizer and the rest by using WordSegmenter)

Fikavec commented 3 years ago

This option is very usefull

The case sensitivity in Spark NLP is defined during runtime, depending on the param being set by the user it will lowercase already tokenized text or leave it as it is. Then it tries to tokenize the tokens by BERT tokenizer to encode the piece ids. So you can change it depending on the model if it supports cased or uncased. (you can test both these scenarios)

but as I understand, by default in Spark-nlp model config this model is not caseSensitive:

{"class":"com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings", "timestamp":1600858075694, "sparkVersion":"2.4.4", "uid":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "paramMap":{ "storageRef":"labse", "inputCols":["sentence"], "caseSensitive":false, "dimension":768, "outputCol":"bert"}, "defaultParamMap":{ "maxSentenceLength":128, "storageRef":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "lazyAnnotator":false, "caseSensitive":false, "batchSize":32," dimension":768} }

As I know

the case sensitivity is the advantage of the model
In many cases, if a model has learned to be case-sensitive, she may not know these lowercase words and artificially predict their vectors (Greetings, Personal Pronoun, Proper Names, Nationalities, Сountries, States, Towns, Days Of The Week, Months, Holidays And School Subjects, The Names Of Universities And Schools, Abriviations etc)

And I'm get testing results that in this model same thing sentences in many languages without lowercased are closer to each other and distance between sentence and lowercased sentense sometimes not so small:

lowercase

I'm think that caseSensitive:false by default for this model may get better results on small train/test datasets (if so, then it can be shown separately in the examples of using this model), but in general, this may not be so good, as it disables the advantage of the original model and may lead to the above inaccuracies, or am I wrong?

maziyarpanahi commented 3 years ago

That’s true, I think by mistake that param was set to false when it was saved. Since the current version is 1 for TF v1, we will make a new one for TF v2 from the version 2 and make sure that param is set to true as well when we upload it.

Fikavec commented 3 years ago

With a powerfull of wikipedia interlanguage links for future I'm prepared multilingual test with sentences about one thing on 171 languages - 108 languages is supported by LaBSE (could not find a sentence in the wo (Wolof) language).

Colab notebook to reproduce: visual_multi_lingual_models_tester.zip CSV with sentence "World Health Organization" in 171 languages multilingual_World_Health_Organization.zip

*Test sentence language is indicated at the top of the images Original TF-Hub LaBSE model cosine distance between sentences about one thing on 88 languages heatmap Original_TF-HUB_LaBSE_88langs_same_thing_embeddings SparkNLP LaBSE model cosine distance between sentences about one thing on 88 languages heatmap SparkNLP_88langs_same_thing_embeddings

Fikavec commented 3 years ago

Updated the above multilingual dataset with code for debugging and idea for future autotesting. Thank you for great job on this amazing project and many interesting workshops. Looking forward Multilingual T5 (mT5) and future releases.

Fikavec commented 3 years ago

Facebook AI Open-Source The FLORES-101 Data Set consisting of 3001 sentences translated in 101 languages by professional translators (more info). Maybe it is great dataset to measure quality of multilingual embeddings and starting point to create auto tests as I’m describes early.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

JohnSnowLabs / spark-nlp

LaBSE Sentence Embeddings model output vectors are not equal to original #2846