Closed ntaherkhani closed 4 years ago
Please check the list of available models and pipelines here, if you don't want to download from S3 which only happens once you can manually download it and load() it as we discussed before. https://github.com/JohnSnowLabs/spark-nlp-models
PS: there is no such thing as universal-sentence-encoder-multilingual-large
in our models, only the models/pipelines available in that list can be used.
Check the documentation for more info: https://nlp.johnsnowlabs.com/docs/en/models
I downloaded the "UniversalSentenceEncoder", "tfhub_use_lg " and tried to load the pretrained model using following line use = UniversalSentenceEncoder.load("/........./Downloads/tfhub_use_lg_en_2.4.0_2.4_1580583670712/")
but I am still getting error. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read. : java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException
did I miss something?
Please copy the entire error, version of Spark NLP and Apache Spark.
I moved this to another repo, it’s not about our examples.
Thanks for your reply .
SparkNLP version : 2.4.5 pyspark version:2.4.5 And I use spark-2.4.3-bin-hadoop2.7
I just tried a very simple sample.
this entire simple code: from pyspark.context import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SQLContext sparkContext = SparkContext(conf=SparkConf().set("spark.driver.extraClassPath", "/Users/....../Downloads/spark-nlp_2.11-2.4.5.jar").set("spark.executor.extraClassPath", "/Users/....../Downloads/spark-nlp_2.11-2.4.5.jar"))
sqlContext = SQLContext (sparkContext)
from sparknlp.annotator import *
use = UniversalSentenceEncoder.load("/Users/ ..../Downloads/tfhub_use_lg_en_2.4.0_2.4_1580583670712/") \ .setInputCols(["document"]) \ .setOutputCol("sentence_embeddings")
And this is entire error:
Traceback (most recent call last):
File "/Users/...../Library/Application Support/IntelliJIdea2019.1/python/helpers/pydev/pydevd.py", line 1741, in
Thanks for sharing the code and error. The code looks good, but how are you starting the SparkSession? More accurately, where are you executing this code? Is it in Python console or pyspark-shell etc.?
I don't see sparknlp.star()
to start the session so I assume you either start it yourself or it already started like being in pyspark shell.
thanks for a quick response. I added sparknlp.start() in code but didn't help I am using Intellij and debug the sample code from this Idea.
I added aws-java-sdk-core.jar and aws-java-sdk-s3 in $SPARK_HOME/jars folder, and the AmazonClientException class not defined error is gone. however, another error comes up. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.read. : java.lang.NoClassDefFoundError: com/typesafe/config/ConfigMergeable So, I think sparknlp's requirements didnt install correctly. do you have any clue for fixing this issue. thanks in advance.
Intellij starts the SparkSession before the actual sparknlp.start. The way to test is this:
$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.4.5 pyspark==2.4.4
Go to Python console and run the code with sparknlp.start()
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start Spark Session with Spark NLP
spark = sparknlp.start()
#the rest of the code
This works then the issue is you importing Spark NLP in your project. You are doing Python in Intellij so you have pip install spark-nlp but you are missing the Fat JAR being inside the SparkSession. You can get the fat JAR from here: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.4.5
The code you have works perfectly when the SparkSession has either the Spark NLP package or jar, so you need to see how to add it to your application.
thanks for your response. with adding Fat Jar file the USE model can be loaded and transform on Intellij. but when I am trying to use spark-submit , it is not working. I am working on MAc with 16G ram and in intellij I set 'spark.driver.memory','15g' for sparkcontex's config. in spark-submit, I play with driver-memory and executor-memory, but didnt help . if I set small value for these two parameters , I will be get : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
if I set big value for driver-memrory:
[ WARN] Lost task 0.0 in stage 1.0 (TID 1, 192.168.0.3, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$dfAnnotate$1: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.getModelIfNotSet(UniversalSentenceEncoder.scala:45)
at com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.annotate(UniversalSentenceEncoder.scala:75)
at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35)
at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)
... 21 more
Caused by: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:179)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
... 30 more
this issue is happening for BERT model as well . however, in intellij too I cannot load and transform by model.
how can fix these issues and how much memory do these models need ?
I am sorry this is really a development question rather than a Spark NLP question. I cannot debug your application without having the entire project here, it's much more complicated and lots of stuff are involved in spark-submit, local path, a FAT Jar that has another Fat JAR.
If the library works in local mode by using Python or Scala in spark-shell or Intellij then the library works. Therefore, you need to debug your own application and how you assembling a JAR for spark-submit. How you are creating SparkSession, master, etc.
Spark NLP being tested before each release on many different situations on many different clusters, so if it fails it must be your config in how you are creating a final JAR.
While I cannot debug your whole project (it's not related to Spark NLP), you need to read the errors more carefully and search for the cause:
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece62 of broadcast_2
I am trying to do Embedding using a pre-trained Universal Sentence Encoder model (universal-sentence-encoder-multilingual-large). how can I load the locally saved model? when I set a locally saved folder for "remore_loc " arg in UniversalSentenceEncoder.pretrained , I am getting the following error. py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException
I don't want to use AWS for loading pretrained model. may you please help me with this issue. thanks