Closed mohcinsarrar closed 2 years ago
Could you please make sure you are using com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4
since you are using Apache Spark 3.2.x?
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet
i try the package com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4 the download start, but i have the following error : Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
Could you please share what’s inside /twitterConsumer.py? There must be an issue with SparkSession you are creating there.
Hi thanks for your reply, this is my code, the code work fine before i add "bert_sequence_classifier_finbert"
spark = SparkSession.builder\
.appName("twitter Consumer")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
dsraw = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "kafka:9092")\
.option("subscribe", "twitterAPI")\
.load()\
.selectExpr("CAST(value AS STRING)")
tweet_schema = StructType([StructField("date", StringType()), StructField("text", StringType())])
def parse_data_from_kafka_message(sdf, schema):
assert sdf.isStreaming == True, "DataFrame doesn't receive streaming data"
col = split(sdf['value'], ';') #split attributes to nested array in one Column
#now expand col to multiple top-level columns
for idx, field in enumerate(schema):
sdf = sdf.withColumn(field.name, col.getItem(idx).cast(field.dataType))
sdf = sdf.withColumn(field.name, regexp_replace(field.name, '"', ''))
return sdf.select([field.name for field in schema])
sdfTweet = parse_data_from_kafka_message(dsraw, tweet_schema)
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained("bert_sequence_classifier_finbert", "en") \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
......
In addition to the change I mentioned for your spark-submit
, could you please replace your SparkSession with the following:
spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[4]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4")\
.getOrCreate()
i have tried your code, i have a new error
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (2195M) > maxPhysicalBytes (1821M)
I think the error is clear, what you are doing requires more memory. Please make sure you set correct config for memory (driver and executors) in your spark-submit or else it runs by 2G which is far less that requirements.
Hi, can you please tell me how to configure memories, my computer have 16G in RAM, i have tow container one master and another worker (i have configure SPARK_WORKER_MEMORY=8G in the environment variables)
HI i solve the error, the problem is the sparkSession config not working, i define the driver memory on runtime
Description
i try download the pre-trained model finBERT, but i have this erro py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
code
sequenceClassifier = BertForSequenceClassification \ .pretrained("bert_sequence_classifier_finbert", "en") \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512)
I execute the script with : spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4 --master spark://spark:7077 /home/twitterConsumer.py
I don't know what could be the reason? Can anyone help me? Appreciate for the quicker responses
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NoClassDefFoundError: org/json4s/package$MappingException at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53) at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66) at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66) at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180) at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167) at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108) at org.json4s.Formats$.customDeserializer(Formats.scala:66) at org.json4s.Extraction$.customOrElse(Extraction.scala:775) at org.json4s.Extraction$.extract(Extraction.scala:454) at org.json4s.Extraction$.extract(Extraction.scala:56) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22) at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:106) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:138) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:136) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at scala.collection.Iterator$$anon$13.next(Iterator.scala:593) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350) at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350) at scala.collection.AbstractIterator.toList(Iterator.scala:1431) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:136) at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:130) at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:94) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:65) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:70) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:204) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:544) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:714) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
Your Environment
3.4.4
3.2.1
openjdk version "1.8.0_332"