JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.88k stars 712 forks source link

Cannot use MultiClassifierDLApproach Pipeline in GCP #6346

Closed Dirkster99 closed 3 years ago

Dirkster99 commented 3 years ago

I am trying to implement a MultiLabel - MultiClass Classification as described in the docs for MultiClassifierDLApproach.

Description

I am running in an airtight environment on the GCP so I have to download/upload pretrained models manually in order to use them in my scenario. For some reason, the usage of a UniversalSentenceEncoder as mentioned in the sample code seems to always throw an error (I've tried 2 different models with 2 different errors I cannot seem to get resolved).

Expected Behavior

The pipeline.fit() statement should create a model from the dataframe shown in the sample code.

Current Behavior

Different exceptions (see details below) occur either at pipeline definition time or when executing pipeline.fit() on labeled training data.

Possible Solution

I've tried different models and parameters but am not sure how to proceed here.

Steps to Reproduce

Defining Pipeline with UniversalSentenceEncoder throws Exception at Pipeline Definition-Time

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

# Download Source: https://nlp.johnsnowlabs.com/2020/12/08/tfhub_use_xling_en_de_xx.html
remoteUnsentEncModelPath = 'gs://lab-dev/notebooks/jupyter/test/SparkNLP/SentenceEmbedding/tfhub_use_xling_en_de_xx_2.7.0_2.4_1607440247381/'
embeddings = UniversalSentenceEncoder.load(remoteUnsentEncModelPath) \
    .setInputCols("document") \
    .setOutputCol("embeddings")

docClassifier = MultiClassifierDLApproach() \
    .setInputCols("embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setValidationSplit(0.1)

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    docClassifier
])

Exception

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-25-b6793882aaea> in <module>
     17 
     18 remoteUnsentEncModelPath = 'gs://lab-dev/notebooks/jupyter/test/SparkNLP/SentenceEmbedding/tfhub_use_xling_en_de_xx_2.7.0_2.4_1607440247381/'
---> 19 embeddings = UniversalSentenceEncoder.load(remoteUnsentEncModelPath) \
     20     .setInputCols("document") \
     21     .setOutputCol("embeddings")

/usr/lib/spark/python/pyspark/ml/util.py in load(cls, path)
    330     def load(cls, path):
    331         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 332         return cls.read().load(path)
    333 
    334 

/usr/lib/spark/python/pyspark/ml/util.py in load(self, path)
    280         if not isinstance(path, str):
    281             raise TypeError("path should be a string, got type %s" % type(path))
--> 282         java_obj = self._jread.load(path)
    283         if not hasattr(self._clazz, "_from_java"):
    284             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"

/opt/conda/miniconda3/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

/opt/conda/miniconda3/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o348.load.
: org.tensorflow.exceptions.TensorFlowException: Op type not registered 'SentencepieceEncodeSparse' in binary running on test-m. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
    at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:101)
    at org.tensorflow.Graph.importGraphDef(Graph.java:630)
    at org.tensorflow.Graph.importGraphDef(Graph.java:201)
    at org.tensorflow.Graph.importGraphDef(Graph.java:185)
    at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.readGraph(TensorflowWrapper.scala:374)
    at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.unpackWithoutBundle(TensorflowWrapper.scala:301)
    at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.readWithSP(TensorflowWrapper.scala:479)
    at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowWithSPModel(TensorflowSerializeModel.scala:180)
    at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowWithSPModel$(TensorflowSerializeModel.scala:153)
    at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder$.readTensorflowWithSPModel(UniversalSentenceEncoder.scala:310)
    at com.johnsnowlabs.nlp.embeddings.ReadUSETensorflowModel.readTensorflow(UniversalSentenceEncoder.scala:281)
    at com.johnsnowlabs.nlp.embeddings.ReadUSETensorflowModel.readTensorflow$(UniversalSentenceEncoder.scala:279)
    at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder$.readTensorflow(UniversalSentenceEncoder.scala:310)
    at com.johnsnowlabs.nlp.embeddings.ReadUSETensorflowModel.$anonfun$$init$$1(UniversalSentenceEncoder.scala:285)
    at com.johnsnowlabs.nlp.embeddings.ReadUSETensorflowModel.$anonfun$$init$$1$adapted(UniversalSentenceEncoder.scala:285)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:47)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:46)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:46)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:57)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:57)
    at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Defining Pipeline with BertSentenceEmbeddings throws Exception at pipeline.fit()

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

# Download Source: https://nlp.johnsnowlabs.com/2021/09/15/sent_bert_base_cased_de.html
remoteBertModelPath = 'gs://lab-dev/notebooks/jupyter/test/SparkNLP/SentenceEmbedding/sent_bert_base_cased_de_3.2.2_3.0_1631706255661/'
useEmbeddings = BertSentenceEmbeddings.load(remoteBertModelPath) \
    .setBatchSize(128) \
    .setInputCols("document") \
    .setOutputCol("embeddings")

docClassifier = MultiClassifierDLApproach() \
    .setInputCols("embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setValidationSplit(0.1)

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    docClassifier
])

# Loading Spark data frame is not shown but the relevante columns look like this
trainDF.show(6, truncate=False)
+--------+-----+------+---------------------------------+----------------------------------------------------------+
|DocId   |RowId|SentId|labels                           |text                                                      |
+--------+-----+------+---------------------------------+----------------------------------------------------------+
|A_1     |1    |2     |[ConfigSelfinstall]              |Installationsanleitung ist nicht vorhanden                |
|A_1     |2    |3     |[ConfigSelfinstall]              |Lösung: Installation durchgeführt                         |
|A_10    |5    |2     |[Vertrag, Rechnung]              |Vertraglicher Preisunterschied wurde ermittelt            |
|A_10    |7    |4     |[DTVFernsehen]                   |bezahlt TV für 2 Personen                                 |
|A_100   |12   |2     |[Rechnung]                       |Ich möchte keine Mehrkosten                               |
+--------+-----+------+---------------------------------+----------------------------------------------------------+
pipelineModel = pipeline.fit(trainDF)

Exception

TypeError                                 Traceback (most recent call last)
<ipython-input-27-eab7fdcd013e> in <module>
----> 1 pipelineModel = pipeline.fit(trainDF)

/usr/lib/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
    159                 return self.copy(params)._fit(dataset)
    160             else:
--> 161                 return self._fit(dataset)
    162         else:
    163             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/usr/lib/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
     99         for stage in stages:
    100             if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
--> 101                 raise TypeError(
    102                     "Cannot recognize a pipeline stage of type %s." % type(stage))
    103         indexOfLastEstimator = -1

TypeError: Cannot recognize a pipeline stage of type <class 'module'>.

Context

I cannot do MultiLabel predictions using SparkNLP in the GCP.

Your Environment


* Setup and installation (Pypi, Conda, Maven, etc.):
* Operating System and version:
* Link to your project (if any):

<!--- Please complete this template with required information for us to be able to reproduce it -->
<!--- If you are reporting an issue, failing to complete this template will result in closing the issue -->
maziyarpanahi commented 3 years ago

Hi,

Please make sure you have the following Spark Configs in your SparkSession when you are launching your cluster:

--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
Dirkster99 commented 3 years ago
spark = SparkSession.builder \
.master('local[*]') \
.appName('NLP Test') \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memory", "6g") \
.config("spark.jars", "gs://lab-dev/notebooks/jupyter/test/SparkNLP/init/SparkNLP_version_3_3_1/spark-nlp-assembly-3.3.1.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M")                       \
.getOrCreate()

sc = spark.sparkContext

I've added the configuration as requested but the errors remain exactly the same as listed above :-(

Dirkster99 commented 3 years ago

Update: I've tried to instanciate the above pipeline with the Multi LG UniversalEncoder from here and this seems to work (the pipeline.fit() statement is being evaluated as we speak).

So, there seems to be a difference between the embeddings from Multi LG UniversalEncoder and the:

Any idea, why the last 2 models throw the above exceptions while the multi-Lingual appears to work?

maziyarpanahi commented 3 years ago

Yes, the multi-lingual USE models use a different and limited (by OS) SentencePiece.

However, I would totally recommend using these models for both English and Multi-lingual as a replacement for UniversalSentenceEncoder:

https://nlp.johnsnowlabs.com/models?q=cmlm

These are new USE models based on BERT architecture, that's why they are using BertSentenceEmbeddings. The team behind Universal Sentence Encoder introduced these in their last papers and they all outperform the previous models we use in UniversalSentenceEncoder. (the multi-lingual you are using doesn't support that many languages, it requires specific SentencePiece ops which may fail on some operating systems, and it's less accurate than the new CMLM models I referenced here)

PS: This BERT Sentence Embeddings German (Base Cased) model doesn't use any SentencePiece so it's not possible to fail with the SentencepieceEncodeSparse error. (all BERT models use a text vocab so they all are compatible with everything)

Dirkster99 commented 3 years ago

So, the bug seems to be in the UniversalSentenceEncoder (USE) itself? I assumed they are compatible (eg: use USE version 2.4 in SparkNLP 3.3) but they are in fact not compatible?

Also, I have no idea what SentencePieceEncodeSparse refers to. Maybe the usage of that particular class could be improved by making some sanity checks and human readable error message instead of exceptions...?

maziyarpanahi commented 3 years ago

Sorry, it's not a bug, the team who trained those multi-lingual USE models decided to use SentencePiece for tokenization which is extra ops on top of a normal UniversalSentenceEncoder. (it's out of our hands, and the SentencePiece ops are not compatible on some operating systems, but they were the only multi-lingual sentence-embeddings models at the time)

Now that we have better, more compatible, and accurate multi-lingual sentence embeddings models published by the same team we won't suggest any multi-lingual USE models. Please use the CMLM models I shared earlier.

PS: we have human-readable errors for unsupported operating systems like Windows for USE multi-lingual models, but some operating systems have exceptions in places that we don't have any control over.