Closed Fikavec closed 2 years ago
We can see differences in embeddings on examples which mean the same thing from Spark-nlp 20 langs multilingual classification demo:
Spark-nlp LaBSE embeddings cosine similiraty heatmap Original TF-Hub LaBSE embeddings cosine similiraty heatmap with do_lower_case = False and max_seq_length = 128
Colab сode to reproduce: visual_multi_lingual_compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip
As we can see maximum difference Spark-nlp from Original TF-hub model in sentence 4 on Vietnamese language. This means that the Spark-nlp and Original models difference is not only in max_seq_length, lowercased sentences, but also in tokenization process and other (using tf.float64, padding short squences by 0f...)?
For simplify debugging issue there are attahed csv with sorted sentences by distance from Spark-nlp embeddings to Original Tf-hub model and colab for reproduce: compute_sentence_difference_distance.zip
With activated tf.float64 and setting max_seq_length = 128 and lowercased input sentences to Original TF-Hub model (! lowercased input sentences, but don't touch do_lower_case option in original model example code, because after setting do_lower_case=True this model makes much worse embeddings than lowercased input) on tasks from multi_class_text_classification and test examples from sections "Model with 20 languages!" model can achieve (given results from run to run is not stable because train_test_split, without setting random sid, model iitialization etc) the following results:
Amazon In Spark-nlp example:
Tripadvice In Spark-nlp example:
News In Spark-nlp example:
Maybe main diferrence from Original model in Spark-nlp multilingual tokenization procedure and embeddings difference appears ony when input sentences are on some languages: Vietnamese, Japan, Urdu... ?
These are very interesting findings, thanks for sharing them. A couple of things come to my mind:
Some explanations:
max_seq_length
, you can change that during runtime as well in Spark NLP via setMaxSentenceLength
. The BERT models support up to 512 so you can choose the length depending on your needs, like matching two comparisons.Some strong possibilities:
I suspect +2/-2 up and down F1-score in English is caused by the custom Tokenization (maybe use RegexTokenizer for simple tokenization by whitespace instead of Tokenizer) and the big difference in the F1-score within multi-lingual is caused by bad multi-lingual tokenization (to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP).
Will continue debugging since this is a very useful multi-lingual embedding.
just tagging you @C-K-Loan as an FYI to be sure LaBSE
bad performance is due to bad BERT multi-lingual tokenization. (we always used English so we might have never noticed)
Thanks for your reply. For this reason
to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP
in order to avoid such hard-to-find differences / inaccuracies in the process of integrating multilingual models in the future and to assess the quality of multilingual embeddings, I propose two methods: 1) Carefully prepare a few groups of sentences which mean the same thing in 100+ languages (sentences about the same thing should be simple and unambiguous without modern vocabulary, hand-picked from texts or from native speakers, but not from an automatic translator). For each group draw cosine heatmap for visual compare as i do here. Heatmaps of original and integrated models can be visual compared to detect differences in some languages. At the same time in a quality multi-lingual embeddings for simple unambiguous thing all squares should be closer to 1 (and lighter). 2) Make automated tests for compute distances from original models to converted/integrated on sentences in 100+ langs (supported by integrated models) like 1) as i do here . For example:
I think these techniques will be useful not only for solving this differences in LaBSE model, but for integrating any multilingual models in the future (like USE, LASER, Sentence Transformers models etc) as a simple fast standard tests of the correctness of integration process. All F1 presented here i'm get for sentences at 20 languages (not in english sentences) from datasets - test_df[["test_sentences","y"]].iloc[:100].
For test examples from section "Model with 20 languages!" from 2 Class Stock Market Sentiment Classifier Training i got the following results with original tf-hub model embeddings (+max_seq_length = 128): In Spark-nlp example:
Thanks, we will investigate this further. The big difference between macro and the micro in Spark NLP indicates some classes did good but some did very bad. Which means the tokenizer may not be doing so great on some languages. That’s being said, some languages don’t have any whitespace and require segmentations such as Chinese, Korean, Japanese, etc. For those we have WordSegmenter annotator which is also one of the issues that these languages are not being tokenized at all by not using WordSegmenter instead of a normal Tokenizer. (That will punish the accuracy in total)
@C-K-Loan Could you please do the test by separating those languages and do them in two different groups? (Some with Tokenizer and the rest by using WordSegmenter)
This option is very usefull
The case sensitivity in Spark NLP is defined during runtime, depending on the param being set by the user it will lowercase already tokenized text or leave it as it is. Then it tries to tokenize the tokens by BERT tokenizer to encode the piece ids. So you can change it depending on the model if it supports cased or uncased. (you can test both these scenarios)
but as I understand, by default in Spark-nlp model config this model is not caseSensitive:
{"class":"com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings", "timestamp":1600858075694, "sparkVersion":"2.4.4", "uid":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "paramMap":{ "storageRef":"labse", "inputCols":["sentence"], "caseSensitive":false, "dimension":768, "outputCol":"bert"}, "defaultParamMap":{ "maxSentenceLength":128, "storageRef":"BERT_SENTENCE_EMBEDDINGS_f12be4ed4fef", "lazyAnnotator":false, "caseSensitive":false, "batchSize":32," dimension":768} }
As I know
And I'm get testing results that in this model same thing sentences in many languages without lowercased are closer to each other and distance between sentence and lowercased sentense sometimes not so small:
I'm think that caseSensitive:false by default for this model may get better results on small train/test datasets (if so, then it can be shown separately in the examples of using this model), but in general, this may not be so good, as it disables the advantage of the original model and may lead to the above inaccuracies, or am I wrong?
That’s true, I think by mistake that param was set to false when it was saved. Since the current version is 1 for TF v1, we will make a new one for TF v2 from the version 2 and make sure that param is set to true as well when we upload it.
With a powerfull of wikipedia interlanguage links for future I'm prepared multilingual test with sentences about one thing on 171 languages - 108 languages is supported by LaBSE (could not find a sentence in the wo (Wolof) language).
Colab notebook to reproduce: visual_multi_lingual_models_tester.zip CSV with sentence "World Health Organization" in 171 languages multilingual_World_Health_Organization.zip
*Test sentence language is indicated at the top of the images Original TF-Hub LaBSE model cosine distance between sentences about one thing on 88 languages heatmap SparkNLP LaBSE model cosine distance between sentences about one thing on 88 languages heatmap
Updated the above multilingual dataset with code for debugging and idea for future autotesting. Thank you for great job on this amazing project and many interesting workshops. Looking forward Multilingual T5 (mT5) and future releases.
Facebook AI Open-Source The FLORES-101 Data Set consisting of 3001 sentences translated in 101 languages by professional translators (more info). Maybe it is great dataset to measure quality of multilingual embeddings and starting point to create auto tests as I’m describes early.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
I’am using embeddings from example https://nlp.johnsnowlabs.com/2020/09/23/labse.html and output vectors although close, but not equal to original vectors https://tfhub.dev/google/LaBSE/1 Why? How original vectors converted to this model, maybe original model modified or finetuned or spark-nlp using different normalization? - on example multiclass/multilabel tasks spark-nlp LaBSE embeddings work differently than original vectors, how get this vectors from original model? Original vectors are case sensitive, but in spark-nlp config this vectors are case insensitive, original vectors max_seq_length = 64, but in spark-nlp = 128... any other differences? For re-produce problem: run code from original vectors page with ["I love NLP", "Many thanks"] and compare outputs to outputs from spark-nlp model page. How open spark-nlp model by tf.saved_model.load after unzip pb file and run infer in tensorflow for "low-level" compare outputs and analyse problem (tensorflow 2.x can't load unzipped bert_sentence_tensorflow model) - maybe difference/problem occurs in converted tf model?
Notebook to reproduce: Compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip