JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 712 forks source link

Pretrained models from disk #7036

Closed GonzaloRuizGit closed 2 years ago

GonzaloRuizGit commented 2 years ago

I'm trying to run a simple example using a pre-trained pipeline from the Spark NLP library. I get an error when I'm downloading the pipeline:

code

import os

from pyspark.sql import SparkSession
from pyspark import SparkConf

def load_spark():
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /usr/lib/hadoop-lzo/lib/*,/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar,/usr/share/aws/aws-java-sdk/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar,/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar,/home/hadoop/libs/* --packages org.apache.hadoop:hadoop-aws:3.2.2,com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3 pyspark-shell'

    conf = SparkConf()

    conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
    conf.set("fs.s3a.access.key", "")
    conf.set("fs.s3a.secret.key", "")
    conf.set("spark.hadoop.fs.s3a.committer.name", "magic")

    spark = SparkSession.builder \
        .appName("Spark NLP") \
        .config("spark.executor.cores", "13") \
        .config("spark.executor.memory", "47G") \
        .config("spark.driver.memory", "47G") \
        .config("spark.driver.cores", "13") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2") \
        .config(conf=conf) \
        .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")

    return spark
from configuration.load_pyspark import load_spark
from sparknlp.pretrained import PretrainedPipeline
spark = load_spark()
pipeline_en_es = PretrainedPipeline("translate_en_es", lang = "xx")

error

Traceback (most recent call last):
  File "/usr/lib64/python3.7/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 141, in __init__
    self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 72, in downloadPipeline
    file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/internal.py", line 232, in __init__
    "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/internal.py", line 165, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/internal.py", line 175, in new_java_obj
    return self._new_java_obj(java_class, *args)
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 66, in _new_java_obj
    return java_obj(*java_args)
  File "/home/hadoop/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/hadoop/.local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException
    at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53)
    at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
    at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
    at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180)
    at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167)
    at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108)
    at org.json4s.Formats$.customDeserializer(Formats.scala:66)
    at org.json4s.Extraction$.customOrElse(Extraction.scala:775)
    at org.json4s.Extraction$.extract(Extraction.scala:454)
    at org.json4s.Extraction$.extract(Extraction.scala:56)
    at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:80)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:108)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:107)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
    at scala.collection.AbstractIterator.to(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
    at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
    at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:107)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:102)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:85)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:96)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:174)
    at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:420)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:526)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 49 more

Then I tried to download the model and load it from disk using sparknlp.start(), which it turned out to work properly:

code

import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from cloudpathlib import CloudPath

spark = sparknlp.start()

cp = CloudPath("s3://path/translate_en_es_xx_3.1.0_2.4_1622841779094")
with tempfile.NamedTemporaryFile() as tf:
    cp.download_to(tf.name+"test")

pipeline_en_es = PretrainedPipeline.from_disk(tf.name+"test")

but the problem with this method is that i am unable to read a .parquet from S3 AWS.

The final model i tried, also has an error:

code

from configuration.load_pyspark import load_spark
spark = load_spark()
cp = CloudPath("s3://path/translate_en_es_xx_3.1.0_2.4_1622841779094")
with tempfile.NamedTemporaryFile() as tf:
    cp.download_to(tf.name+"test")

pipeline_en_es = PretrainedPipeline.from_disk(tf.name+"test")

error

Traceback (most recent call last):
  File "/usr/lib64/python3.7/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 5, in <module>
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 148, in from_disk
    return PretrainedPipeline(None, None, None, parse_embeddings, path)
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 144, in __init__
    self.light_model = LightPipeline(self.model, parse_embeddings)
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/base.py", line 79, in __init__
    self._lightPipeline = _internal._LightPipeline(pipelineModel, parse_embeddings).apply()
  File "/home/hadoop/.local/lib/python3.7/site-packages/sparknlp/internal.py", line 267, in __init__
    super(_LightPipeline, self).__init__("com.johnsnowlabs.nlp.LightPipeline", pipelineModel._to_java(),
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/ml/pipeline.py", line 333, in _to_java
    java_stages[idx] = stage._to_java()
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 226, in _to_java
    self._transfer_params_to_java()
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 146, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._defaultParamMap[param])
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 132, in _make_java_param_pair
    java_param = self._java_obj.getParam(param.name)
  File "/home/hadoop/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/hadoop/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/hadoop/.local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o143.getParam.
: java.util.NoSuchElementException: Param ignoreTokenIds does not exist.
    at org.apache.spark.ml.param.Params.$anonfun$getParam$2(params.scala:705)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.ml.param.Params.getParam(params.scala:705)
    at org.apache.spark.ml.param.Params.getParam$(params.scala:703)
    at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:41)
    at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
maziyarpanahi commented 2 years ago

Hi,

GonzaloRuizGit commented 2 years ago

Hi maziyarpanahi, thanks for your early reply. I have edited the first comment. What do you mean by enviroment?

maziyarpanahi commented 2 years ago

Thank you, the following information is very important. (your first error seems to be you used a spark-nlp package that was for spark 3.0/3.1 on spark 3.2, the second one seems to be a mismatch between your PyPI spark-nlp and the Maven. So knowing the following and everything else would be very helpful)

Your Environment

GonzaloRuizGit commented 2 years ago

Since I am using AWS, I don't know if the right way to know the Java version or the setup version. So if you see something strange in the versions I gave you, I can double check that.

Your Environment

maziyarpanahi commented 2 years ago

Very nice! It's enough to resolve the first issue for sure. For Apache Spark 3.2.x you need a different Spark NLP package from Maven:

Reference:

For the second issue, please make sure both Maven and PyPI are the same versions, (you mentioned 3.4.1, but your code says 3.1.2, let's make sure they are both 3.4.1)

GonzaloRuizGit commented 2 years ago

Thank you very much, As you said, the problem was the spark-nlp version.