JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.81k stars 708 forks source link

LaBSE Sentence Embeddings model output vectors are not equal to original #2846

Closed Fikavec closed 2 years ago

Fikavec commented 3 years ago

I’am using embeddings from example https://nlp.johnsnowlabs.com/2020/09/23/labse.html and output vectors although close, but not equal to original vectors https://tfhub.dev/google/LaBSE/1 Why? How original vectors converted to this model, maybe original model modified or finetuned or spark-nlp using different normalization? - on example multiclass/multilabel tasks spark-nlp LaBSE embeddings work differently than original vectors, how get this vectors from original model? Original vectors are case sensitive, but in spark-nlp config this vectors are case insensitive, original vectors max_seq_length = 64, but in spark-nlp = 128... any other differences? For re-produce problem: run code from original vectors page with ["I love NLP", "Many thanks"] and compare outputs to outputs from spark-nlp model page. How open spark-nlp model by tf.saved_model.load after unzip pb file and run infer in tensorflow for "low-level" compare outputs and analyse problem (tensorflow 2.x can't load unzipped bert_sentence_tensorflow model) - maybe difference/problem occurs in converted tf model?

difference

Notebook to reproduce: Compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip

Fikavec commented 3 years ago

We can see differences in embeddings on examples which mean the same thing from Spark-nlp 20 langs multilingual classification demo:

  1. Science has advanced rapidly over the last century
  2. Die Wissenschaft hat im letzten Jahrhundert rasante Fortschritte gemacht
  3. 在上个世纪,科学发展迅速
  4. Die wetenskap het die afgelope eeu vinnig gevorder
  5. Khoa học đã phát triển nhanh chóng trong thế kỷ qua
  6. 科学は前世紀にわたって急速に進歩しました
  7. Isayensi ithuthuke ngokushesha ngekhulu leminyaka elidlule
  8. Bilim, geçen yüzyılda hızla ilerledi
  9. המדע התקדם במהירות במהלך המאה האחרונה
  10. గత శతాబ్దంలో సైన్స్ వేగంగా అభివృద్ధి చెందింది
  11. Наука стремительно развивалась за последнее столетие
  12. سائنس گذشتہ صدی کے دوران تیزی سے ترقی کرچکی ہے
  13. "विज्ञान पिछली सदी में तेजी से आगे बढ़ा है
  14. Соңгы гасырда фән тиз үсә
  15. La science a progressé rapidement au cours du siècle dernier
  16. วิทยาศาสตร์ก้าวหน้าอย่างรวดเร็วในช่วงศตวรรษที่ผ่านมา
  17. វិទ្យាសាស្ត្របានជឿនលឿនយ៉ាងលឿនក្នុងរយៈពេលមួយសតវត្សចុងក្រោយនេះ
  18. וויסנשאַפֿט איז ראַפּאַדלי אַוואַנסירטע איבער די לעצטע יאָרהונדערט
  19. Илим акыркы кылымда тездик менен өнүккөн
  20. கடந்த நூற்றாண்டில் அறிவியல் வேகமாக முன்னேறியுள்ளது

Spark-nlp LaBSE embeddings cosine similiraty heatmap spark_cossim Original TF-Hub LaBSE embeddings cosine similiraty heatmap with do_lower_case = False and max_seq_length = 128 labse_cossim

Colab сode to reproduce: visual_multi_lingual_compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip

As we can see maximum difference Spark-nlp from Original TF-hub model in sentence 4 on Vietnamese language. This means that the Spark-nlp and Original models difference is not only in max_seq_length, lowercased sentences, but also in tokenization process and other (using tf.float64, padding short squences by 0f...)?

Fikavec commented 3 years ago

For simplify debugging issue there are attahed csv with sorted sentences by distance from Spark-nlp embeddings to Original Tf-hub model and colab for reproduce: compute_sentence_difference_distance.zip diff

Fikavec commented 3 years ago

With activated tf.float64 and setting max_seq_length = 128 and lowercased input sentences to Original TF-Hub model (! lowercased input sentences, but don't touch do_lower_case option in original model example code, because after setting do_lower_case=True this model makes much worse embeddings than lowercased input) on tasks from multi_class_text_classification and test examples from sections "Model with 20 languages!" model can achieve (given results from run to run is not stable because train_test_split, without setting random sid, model iitialization etc) the following results:

Maybe main diferrence from Original model in Spark-nlp multilingual tokenization procedure and embeddings difference appears ony when input sentences are on some languages: Vietnamese, Japan, Urdu... ?

maziyarpanahi commented 3 years ago

These are very interesting findings, thanks for sharing them. A couple of things come to my mind:

Some explanations:

Some strong possibilities:

I suspect +2/-2 up and down F1-score in English is caused by the custom Tokenization (maybe use RegexTokenizer for simple tokenization by whitespace instead of Tokenizer) and the big difference in the F1-score within multi-lingual is caused by bad multi-lingual tokenization (to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP).

Will continue debugging since this is a very useful multi-lingual embedding.

maziyarpanahi commented 3 years ago

just tagging you @C-K-Loan as an FYI to be sure LaBSE bad performance is due to bad BERT multi-lingual tokenization. (we always used English so we might have never noticed)

Fikavec commented 3 years ago

Thanks for your reply. For this reason

to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP

in order to avoid such hard-to-find differences / inaccuracies in the process of integrating multilingual models in the future and to assess the quality of multilingual embeddings, I propose two methods: 1) Carefully prepare a few groups of sentences which mean the same thing in 100+ languages (sentences about the same thing should be simple and unambiguous without modern vocabulary, hand-picked from texts or from native speakers, but not from an automatic translator). For each group draw cosine heatmap for visual compare as i do here. Heatmaps of original and integrated models can be visual compared to detect differences in some languages. At the same time in a quality multi-lingual embeddings for simple unambiguous thing all squares should be closer to 1 (and lighter). 2) Make automated tests for compute distances from original models to converted/integrated on sentences in 100+ langs (supported by integrated models) like 1) as i do here . For example:

I think these techniques will be useful not only for solving this differences in LaBSE model, but for integrating any multilingual models in the future (like USE, LASER, Sentence Transformers models etc) as a simple fast standard tests of the correctness of integration process. All F1 presented here i'm get for sentences at 20 languages (not in english sentences) from datasets - test_df[["test_sentences","y"]].iloc[:100].

Fikavec commented 3 years ago

For test examples from section "Model with 20 languages!" from 2 Class Stock Market Sentiment Classifier Training i got the following results with original tf-hub model embeddings (+max_seq_length = 128): stock In Spark-nlp example: stock_spark

maziyarpanahi commented 3 years ago

Thanks, we will investigate this further. The big difference between macro and the micro in Spark NLP indicates some classes did good but some did very bad. Which means the tokenizer may not be doing so great on some languages. That’s being said, some languages don’t have any whitespace and require segmentations such as Chinese, Korean, Japanese, etc. For those we have WordSegmenter annotator which is also one of the issues that these languages are not being tokenized at all by not using WordSegmenter instead of a normal Tokenizer. (That will punish the accuracy in total)

@C-K-Loan Could you please do the test by separating those languages and do them in two different groups? (Some with Tokenizer and the rest by using WordSegmenter)

Fikavec commented 3 years ago

This option is very usefull

The case sensitivity in Spark NLP is defined during runtime, depending on the param being set by the user it will lowercase already tokenized text or leave it as it is. Then it tries to tokenize the tokens by BERT tokenizer to encode the piece ids. So you can change it depending on the model if it supports cased or uncased. (you can test both these scenarios)

but as I understand, by default in Spark-nlp model config this model is not caseSensitive:

{"class":"com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings", "timestamp":1600858075694, "sparkVersion":"2.4.4", "uid":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "paramMap":{ "storageRef":"labse", "inputCols":["sentence"], "caseSensitive":false, "dimension":768, "outputCol":"bert"}, "defaultParamMap":{ "maxSentenceLength":128, "storageRef":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "lazyAnnotator":false, "caseSensitive":false, "batchSize":32," dimension":768} }

As I know

And I'm get testing results that in this model same thing sentences in many languages without lowercased are closer to each other and distance between sentence and lowercased sentense sometimes not so small:

lowercase

I'm think that caseSensitive:false by default for this model may get better results on small train/test datasets (if so, then it can be shown separately in the examples of using this model), but in general, this may not be so good, as it disables the advantage of the original model and may lead to the above inaccuracies, or am I wrong?

maziyarpanahi commented 3 years ago

That’s true, I think by mistake that param was set to false when it was saved. Since the current version is 1 for TF v1, we will make a new one for TF v2 from the version 2 and make sure that param is set to true as well when we upload it.

Fikavec commented 3 years ago

With a powerfull of wikipedia interlanguage links for future I'm prepared multilingual test with sentences about one thing on 171 languages - 108 languages is supported by LaBSE (could not find a sentence in the wo (Wolof) language).

Colab notebook to reproduce: visual_multi_lingual_models_tester.zip CSV with sentence "World Health Organization" in 171 languages multilingual_World_Health_Organization.zip

*Test sentence language is indicated at the top of the images Original TF-Hub LaBSE model cosine distance between sentences about one thing on 88 languages heatmap Original_TF-HUB_LaBSE_88langs_same_thing_embeddings SparkNLP LaBSE model cosine distance between sentences about one thing on 88 languages heatmap SparkNLP_88langs_same_thing_embeddings

Fikavec commented 3 years ago

Updated the above multilingual dataset with code for debugging and idea for future autotesting. Thank you for great job on this amazing project and many interesting workshops. Looking forward Multilingual T5 (mT5) and future releases.

Fikavec commented 3 years ago

Facebook AI Open-Source The FLORES-101 Data Set consisting of 3001 sentences translated in 101 languages by professional translators (more info). Maybe it is great dataset to measure quality of multilingual embeddings and starting point to create auto tests as I’m describes early.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days