TypeError("Params must be a param map but got %s." % type(params))

jmcmt87 commented 2 years ago

When I use the model 'bert_sequence_classifier_emotion' or DistilBERT Sequence Classification - Emotion it throws a Type Error, any other emotion classifier works fine.

Description

When I use the 'bert_sequence_classifier_emotion' I get the following error message: TypeError("Params must be a param map but got %s." % type(params)), this error seems to be an incompatibility of the output of the model and the pyspark.ml.pipeline

I consider it a bug because this model has worked in my app for days and all of a sudden it stopped working, I thought it could have something to do with my dataframe but I still have the same problem with other dataframes I tried.

Also, this only happens to me when I use a BERT model that was originally from Hugging Face, as I said, any other classificator works fine.

I use the pre-trained model as it's shown in the Spark NLP page:

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
      .pretrained('bert_sequence_classifier_emotion', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')

emotion_pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])

Expected Behavior

To be able to process the dataframe with it, as I used to be able to before the error

Current Behavior

It throws the following error message:

Input In [2], in process_df(sentiment_df, emotion_pipeline)
     75 except Exception:
     76     pass
---> 77 return emotion_pipeline.fit(sentiment_df).transform(sentiment_df) \
     78     .select('created_at', 'text', 'topic', 'sentiment',
     79     F.element_at(F.col('class.result'), 1).alias('emotion'))

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/base.py:217, in Transformer.transform(self, dataset, params)
    215         return self.copy(params)._transform(dataset)
    216     else:
--> 217         return self._transform(dataset)
    218 else:
    219     raise TypeError("Params must be a param map but got %s." % type(params))

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/pipeline.py:278, in PipelineModel._transform(self, dataset)
    276 def _transform(self, dataset):
    277     for t in self.stages:
--> 278         dataset = t.transform(dataset)
    279     return dataset

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/base.py:217, in Transformer.transform(self, dataset, params)
    215         return self.copy(params)._transform(dataset)
    216     else:
--> 217         return self._transform(dataset)
    218 else:
    219     raise TypeError("Params must be a param map but got %s." % type(params))

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/wrapper.py:349, in JavaTransformer._transform(self, dataset)
    348 def _transform(self, dataset):
--> 349     self._transfer_params_to_java()
    350     return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/wrapper.py:146, in JavaParams._transfer_params_to_java(self)
    144         self._java_obj.set(pair)
    145     if self.hasDefault(param):
--> 146         pair = self._make_java_param_pair(param, self._defaultParamMap[param])
    147         pair_defaults.append(pair)
    148 if len(pair_defaults) > 0:

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/ml/wrapper.py:132, in JavaParams._make_java_param_pair(self, param, value)
    130 sc = SparkContext._active_spark_context
    131 param = self._resolveParam(param)
--> 132 java_param = self._java_obj.getParam(param.name)
    133 java_value = _py2java(sc, value)
    134 return java_param.w(java_value)

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/sql/utils.py:111, in capture_sql_exception.<locals>.deco(*a, **kw)
    109 def deco(*a, **kw):
    110     try:
--> 111         return f(*a, **kw)
    112     except py4j.protocol.Py4JJavaError as e:
    113         converted = convert_exception(e.java_exception)

File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o208.getParam.
: java.util.NoSuchElementException: Param activation does not exist.
    at org.apache.spark.ml.param.Params.$anonfun$getParam$2(params.scala:705)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.ml.param.Params.getParam(params.scala:705)
    at org.apache.spark.ml.param.Params.getParam$(params.scala:703)
    at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:41)
    at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:748)

Possible Solution

Steps to Reproduce

This is how I'm using it:

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
      .pretrained('bert_sequence_classifier_emotion', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')

emotion_pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])

df = spark.read.format("parquet").load(paths)

emotion_pipeline.fit(df).transform(df)

I decided to try another pre-trained model from Spark NLP instead of this one and it works, I also try with another dataframes and it doesn't work, so I'm sure is the model that is not working.

Context

I cannot continue unless I use other pre-trained model ('classifierdl_use_emotion'), but it's much worse.

Your Environment

Spark NLP version: 3.4.2
Apache NLP version spark.version: 3.2.1
Java version java -version: "1.8.0_311"
Setup and installation: PySpark 3.2.1, pipenv and Jupyter notebook
Operating System and version: macOS Big Sur 11.6.5
Link to your project: https://github.com/jmcmt87/spark_app_twitter/blob/main/tests/main.py

And this is my PySpark configuration:

packages = ','.join(
        [
            'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1',
            'com.amazonaws:aws-java-sdk:1.11.563',
            'org.apache.hadoop:hadoop-aws:3.2.2',
            'org.apache.hadoop:hadoop-client-api:3.2.2',
            'org.apache.hadoop:hadoop-client-runtime:3.2.2',
            'org.apache.hadoop:hadoop-yarn-server-web-proxy:3.2.2',
            'com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2',
            'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1'
        ]
    )

spark = SparkSession.builder.appName('twitter_app_nlp')\
        .master("local[*]")\
        .config('spark.jars.packages', packages) \
        .config('spark.streaming.stopGracefullyOnShutdown', 'true')\
        .config('spark.hadoop.fs.s3a.aws.credentials.provider', 
                'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') \
        .config('spark.hadoop.fs.s3a.access.key', 
                config.get("ACCESS_KEY")) \
        .config('spark.hadoop.fs.s3a.secret.key', 
                config.get("SECRET_ACCESS_KEY")) \
        .config("spark.hadoop.fs.s3a.impl",
                "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config('spark.sql.shuffle.partitions', 3) \
        .config("spark.driver.memory","8G")\
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.kryoserializer.buffer.max", "2000M")\
        .config("spark.mongodb.input.uri", config.get("mongoDB")) \
        .config("spark.mongodb.output.uri", config.get("mongoDB")) \
        .getOrCreate()

maziyarpanahi commented 2 years ago

Hi @jmcmt87

: java.util.NoSuchElementException: Param activation does not exist. The activation param was introduced in spark-nlp==3.4.3

Your Spark NLP Maven version is not matched with your spark-nlp PyPI version. (one of them has this param, the other one doesn't) - you could be in a different Python/Conda envs which one might be 3.4.2 so you didn't get any error.

please make sure your PyPI package is 3.4.3 : pip install --upgrade spark-nlp==3.4.3
please make sure you are using spark-nlp-spark32_2.12:3.4.3

A simple working example: https://colab.research.google.com/drive/1K01lkBEIE4dGTqIIj9CwG3Iu0wtgFWYz?usp=sharing

jmcmt87 commented 2 years ago

It works! Thanks!

jmcmt87 commented 2 years ago

Oh, no, it's not working, same issue, even tho upgrading to 3.4.3 and having the package spark-nlp-spark32_2.12:3.4.3

maziyarpanahi commented 2 years ago

there must be a mismatch somewhere in your setup for sure, as you saw in the Colab there is no issue with the release or the model. (just having something doesn't mean you actually have it, something can be cached or pull something else from somewhere else. I would start looking closely there)

jmcmt87 commented 2 years ago

As for the working example in Colab, I can also get the classes, that doesn't seem to be the issue. I will check it further

jmcmt87 commented 2 years ago

It works, you were right

JohnSnowLabs / spark-nlp