JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.83k stars 710 forks source link

using pertained Universal Sentence Encoder model which saved locally. #847

Closed ntaherkhani closed 4 years ago

ntaherkhani commented 4 years ago

I am trying to do Embedding using a pre-trained Universal Sentence Encoder model (universal-sentence-encoder-multilingual-large). how can I load the locally saved model? when I set a locally saved folder for "remore_loc " arg in UniversalSentenceEncoder.pretrained , I am getting the following error. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I don't want to use AWS for loading pretrained model. may you please help me with this issue. thanks

maziyarpanahi commented 4 years ago

Please check the list of available models and pipelines here, if you don't want to download from S3 which only happens once you can manually download it and load() it as we discussed before. https://github.com/JohnSnowLabs/spark-nlp-models

PS: there is no such thing as universal-sentence-encoder-multilingual-large in our models, only the models/pipelines available in that list can be used.

Check the documentation for more info: https://nlp.johnsnowlabs.com/docs/en/models

ntaherkhani commented 4 years ago

I downloaded the "UniversalSentenceEncoder", "tfhub_use_lg " and tried to load the pretrained model using following line use = UniversalSentenceEncoder.load("/........./Downloads/tfhub_use_lg_en_2.4.0_2.4_1580583670712/")

but I am still getting error. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read. : java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

did I miss something?

maziyarpanahi commented 4 years ago

Please copy the entire error, version of Spark NLP and Apache Spark.

I moved this to another repo, it’s not about our examples.

ntaherkhani commented 4 years ago

Thanks for your reply .

SparkNLP version : 2.4.5 pyspark version:2.4.5 And I use spark-2.4.3-bin-hadoop2.7

I just tried a very simple sample.

this entire simple code: from pyspark.context import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SQLContext sparkContext = SparkContext(conf=SparkConf().set("spark.driver.extraClassPath", "/Users/....../Downloads/spark-nlp_2.11-2.4.5.jar").set("spark.executor.extraClassPath", "/Users/....../Downloads/spark-nlp_2.11-2.4.5.jar"))

sqlContext = SQLContext (sparkContext)

from sparknlp.annotator import *

use = UniversalSentenceEncoder.load("/Users/ ..../Downloads/tfhub_use_lg_en_2.4.0_2.4_1580583670712/") \ .setInputCols(["document"]) \ .setOutputCol("sentence_embeddings")

And this is entire error: Traceback (most recent call last): File "/Users/...../Library/Application Support/IntelliJIdea2019.1/python/helpers/pydev/pydevd.py", line 1741, in main() File "/Users/......../Library/Application Support/IntelliJIdea2019.1/python/helpers/pydev/pydevd.py", line 1735, in main globals = debugger.run(setup['file'], None, None, is_module) File "/Users/......../Library/Application Support/IntelliJIdea2019.1/python/helpers/pydev/pydevd.py", line 1135, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/Users/......../Library/Application Support/IntelliJIdea2019.1/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/Users/........../test.py", line 76, in use = UniversalSentenceEncoder.load("/Users/....../Downloads/tfhub_use_lg_en_2.4.0_2.4_1580583670712/") \ File "/Users/......./tmp/pythonML3.6/lib/python3.6/site-packages/pyspark/ml/util.py", line 362, in load return cls.read().load(path) File "/Users/........./tmp/pythonML3.6/lib/python3.6/site-packages/sparknlp/internal.py", line 51, in read return AnnotatorJavaMLReader(cls()) File "/Users/......./tmp/pythonML3.6/lib/python3.6/site-packages/pyspark/ml/util.py", line 294, in init self._jread = self._load_java_obj(clazz).read() File "/Users/......../tmp/pythonML3.6/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/Users/......./tmp/pythonML3.6/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/......../tmp/pythonML3.6/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read. : java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException at com.johnsnowlabs.nlp.HasPretrained$class.$init$(HasPretrained.scala:16) at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder$.(UniversalSentenceEncoder.scala:146) at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder$.(UniversalSentenceEncoder.scala) at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read(UniversalSentenceEncoder.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonClientException at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 15 more

maziyarpanahi commented 4 years ago

Thanks for sharing the code and error. The code looks good, but how are you starting the SparkSession? More accurately, where are you executing this code? Is it in Python console or pyspark-shell etc.? I don't see sparknlp.star() to start the session so I assume you either start it yourself or it already started like being in pyspark shell.

ntaherkhani commented 4 years ago

thanks for a quick response. I added sparknlp.start() in code but didn't help I am using Intellij and debug the sample code from this Idea.

I added aws-java-sdk-core.jar and aws-java-sdk-s3 in $SPARK_HOME/jars folder, and the AmazonClientException class not defined error is gone. however, another error comes up. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read. : java.lang.NoClassDefFoundError: com/typesafe/config/ConfigMergeable So, I think sparknlp's requirements didnt install correctly. do you have any clue for fixing this issue. thanks in advance.

maziyarpanahi commented 4 years ago

Intellij starts the SparkSession before the actual sparknlp.start. The way to test is this:

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.4.5 pyspark==2.4.4

Go to Python console and run the code with sparknlp.start()

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
spark = sparknlp.start()

#the rest of the code

This works then the issue is you importing Spark NLP in your project. You are doing Python in Intellij so you have pip install spark-nlp but you are missing the Fat JAR being inside the SparkSession. You can get the fat JAR from here: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.4.5

The code you have works perfectly when the SparkSession has either the Spark NLP package or jar, so you need to see how to add it to your application.

ntaherkhani commented 4 years ago

thanks for your response. with adding Fat Jar file the USE model can be loaded and transform on Intellij. but when I am trying to use spark-submit , it is not working. I am working on MAc with 16G ram and in intellij I set 'spark.driver.memory','15g' for sparkcontex's config. in spark-submit, I play with driver-memory and executor-memory, but didnt help . if I set small value for these two parameters , I will be get : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

if I set big value for driver-memrory:

[ WARN] Lost task 0.0 in stage 1.0 (TID 1, 192.168.0.3, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$dfAnnotate$1: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)
    at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
    at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
    at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
    at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
    at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
    at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.getModelIfNotSet(UniversalSentenceEncoder.scala:45)
    at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.annotate(UniversalSentenceEncoder.scala:75)
    at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35)
    at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)
    ... 21 more
Caused by: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:179)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    ... 30 more

this issue is happening for BERT model as well . however, in intellij too I cannot load and transform by model.

how can fix these issues and how much memory do these models need ?

maziyarpanahi commented 4 years ago

I am sorry this is really a development question rather than a Spark NLP question. I cannot debug your application without having the entire project here, it's much more complicated and lots of stuff are involved in spark-submit, local path, a FAT Jar that has another Fat JAR.

If the library works in local mode by using Python or Scala in spark-shell or Intellij then the library works. Therefore, you need to debug your own application and how you assembling a JAR for spark-submit. How you are creating SparkSession, master, etc.

Spark NLP being tested before each release on many different situations on many different clusters, so if it fails it must be your config in how you are creating a final JAR.

While I cannot debug your whole project (it's not related to Spark NLP), you need to read the errors more carefully and search for the cause:

Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2