Closed Dirkster99 closed 3 years ago
Hi,
Please make sure you have the following Spark Configs in your SparkSession when you are launching your cluster:
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
spark = SparkSession.builder \
.master('local[*]') \
.appName('NLP Test') \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memory", "6g") \
.config("spark.jars", "gs://lab-dev/notebooks/jupyter/test/SparkNLP/init/SparkNLP_version_3_3_1/spark-nlp-assembly-3.3.1.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.getOrCreate()
sc = spark.sparkContext
I've added the configuration as requested but the errors remain exactly the same as listed above :-(
Update: I've tried to instanciate the above pipeline with the Multi LG UniversalEncoder from here and this seems to work (the pipeline.fit() statement is being evaluated as we speak).
So, there seems to be a difference between the embeddings from Multi LG UniversalEncoder and the:
Any idea, why the last 2 models throw the above exceptions while the multi-Lingual appears to work?
Yes, the multi-lingual USE models use a different and limited (by OS) SentencePiece.
However, I would totally recommend using these models for both English and Multi-lingual as a replacement for UniversalSentenceEncoder:
https://nlp.johnsnowlabs.com/models?q=cmlm
These are new USE models based on BERT architecture, that's why they are using BertSentenceEmbeddings
. The team behind Universal Sentence Encoder introduced these in their last papers and they all outperform the previous models we use in UniversalSentenceEncoder. (the multi-lingual you are using doesn't support that many languages, it requires specific SentencePiece ops which may fail on some operating systems, and it's less accurate than the new CMLM models I referenced here)
PS: This BERT Sentence Embeddings German (Base Cased)
model doesn't use any SentencePiece so it's not possible to fail with the SentencepieceEncodeSparse
error. (all BERT models use a text vocab so they all are compatible with everything)
So, the bug seems to be in the UniversalSentenceEncoder (USE) itself? I assumed they are compatible (eg: use USE version 2.4 in SparkNLP 3.3) but they are in fact not compatible?
Also, I have no idea what SentencePieceEncodeSparse refers to. Maybe the usage of that particular class could be improved by making some sanity checks and human readable error message instead of exceptions...?
Sorry, it's not a bug, the team who trained those multi-lingual USE models decided to use SentencePiece for tokenization which is extra ops on top of a normal UniversalSentenceEncoder. (it's out of our hands, and the SentencePiece ops are not compatible on some operating systems, but they were the only multi-lingual sentence-embeddings models at the time)
Now that we have better, more compatible, and accurate multi-lingual sentence embeddings models published by the same team we won't suggest any multi-lingual USE models. Please use the CMLM
models I shared earlier.
PS: we have human-readable errors for unsupported operating systems like Windows for USE multi-lingual models, but some operating systems have exceptions in places that we don't have any control over.
I am trying to implement a MultiLabel - MultiClass Classification as described in the docs for MultiClassifierDLApproach.
Description
I am running in an airtight environment on the GCP so I have to download/upload pretrained models manually in order to use them in my scenario. For some reason, the usage of a UniversalSentenceEncoder as mentioned in the sample code seems to always throw an error (I've tried 2 different models with 2 different errors I cannot seem to get resolved).
Expected Behavior
The
pipeline.fit()
statement should create a model from the dataframe shown in the sample code.Current Behavior
Different exceptions (see details below) occur either at pipeline definition time or when executing pipeline.fit() on labeled training data.
Possible Solution
I've tried different models and parameters but am not sure how to proceed here.
Steps to Reproduce
Defining Pipeline with UniversalSentenceEncoder throws Exception at Pipeline Definition-Time
Exception
Defining Pipeline with BertSentenceEmbeddings throws Exception at pipeline.fit()
Exception
Context
I cannot do MultiLabel predictions using SparkNLP in the GCP.
Your Environment
sparknlp.version()
: 3.3.1spark.version
: 3.1.1java -version
: