JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 712 forks source link

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. #8445

Closed mohcinsarrar closed 2 years ago

mohcinsarrar commented 2 years ago

Description

i try download the pre-trained model finBERT, but i have this erro py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.

code

sequenceClassifier = BertForSequenceClassification \ .pretrained("bert_sequence_classifier_finbert", "en") \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512)

I execute the script with : spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4 --master spark://spark:7077 /home/twitterConsumer.py

I don't know what could be the reason? Can anyone help me? Appreciate for the quicker responses

Error

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NoClassDefFoundError: org/json4s/package$MappingException at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53) at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66) at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66) at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180) at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167) at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108) at org.json4s.Formats$.customDeserializer(Formats.scala:66) at org.json4s.Extraction$.customOrElse(Extraction.scala:775) at org.json4s.Extraction$.extract(Extraction.scala:454) at org.json4s.Extraction$.extract(Extraction.scala:56) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22) at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:106) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:138) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:136) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at scala.collection.Iterator$$anon$13.next(Iterator.scala:593) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350) at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350) at scala.collection.AbstractIterator.toList(Iterator.scala:1431) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:136) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:130) at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:94) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:65) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:70) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:204) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:544) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:714) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Your Environment

maziyarpanahi commented 2 years ago

Could you please make sure you are using com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4 since you are using Apache Spark 3.2.x? https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet

mohcinsarrar commented 2 years ago

i try the package com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4 the download start, but i have the following error : Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.

maziyarpanahi commented 2 years ago

Could you please share what’s inside /twitterConsumer.py? There must be an issue with SparkSession you are creating there.

mohcinsarrar commented 2 years ago

Hi thanks for your reply, this is my code, the code work fine before i add "bert_sequence_classifier_finbert"

spark = SparkSession.builder\
                    .appName("twitter Consumer")\
                    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

dsraw = spark\
  .readStream\
  .format("kafka")\
  .option("kafka.bootstrap.servers", "kafka:9092")\
  .option("subscribe", "twitterAPI")\
  .load()\
  .selectExpr("CAST(value AS STRING)")

tweet_schema = StructType([StructField("date", StringType()), StructField("text", StringType())])

def parse_data_from_kafka_message(sdf, schema):

  assert sdf.isStreaming == True, "DataFrame doesn't receive streaming data"
  col = split(sdf['value'], ';') #split attributes to nested array in one Column
  #now expand col to multiple top-level columns
  for idx, field in enumerate(schema): 
      sdf = sdf.withColumn(field.name, col.getItem(idx).cast(field.dataType))
      sdf = sdf.withColumn(field.name, regexp_replace(field.name, '"', ''))
  return sdf.select([field.name for field in schema])

sdfTweet = parse_data_from_kafka_message(dsraw, tweet_schema)

document_assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
      .pretrained("bert_sequence_classifier_finbert", "en") \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class') \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])
......
maziyarpanahi commented 2 years ago

In addition to the change I mentioned for your spark-submit, could you please replace your SparkSession with the following:

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4")\
    .getOrCreate()
mohcinsarrar commented 2 years ago

i have tried your code, i have a new error

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (2195M) > maxPhysicalBytes (1821M)
maziyarpanahi commented 2 years ago

I think the error is clear, what you are doing requires more memory. Please make sure you set correct config for memory (driver and executors) in your spark-submit or else it runs by 2G which is far less that requirements.

mohcinsarrar commented 2 years ago

Hi, can you please tell me how to configure memories, my computer have 16G in RAM, i have tow container one master and another worker (i have configure SPARK_WORKER_MEMORY=8G in the environment variables)

mohcinsarrar commented 2 years ago

HI i solve the error, the problem is the sparkSession config not working, i define the driver memory on runtime