JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.83k stars 709 forks source link

Json4 parse error on ResourceMetadata while running few models in spark-nlp #14327

Open nimesh1601 opened 3 months ago

nimesh1601 commented 3 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

Trying out an example similar to https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbedding

Current Behavior

We are getting json4s exception while spark-nlp is trying to get resource metadata Exception stacktrace

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: org.json4s.MappingException: Parsed JSON values do not match with class constructor
args=
arg types=
executable=Executable(Constructor(public com.johnsnowlabs.nlp.pretrained.ResourceMetadata(java.lang.String,scala.Option,scala.Option,scala.Option,boolean,java.sql.Timestamp,boolean,scala.Option,java.lang.String,scala.Option)))
cause=wrong number of arguments
types comparison result=MISSING(java.lang.String),MISSING(scala.Option),MISSING(scala.Option),MISSING(scala.Option),MISSING(boolean),MISSING(java.sql.Timestamp),MISSING(boolean),MISSING(scala.Option),MISSING(java.lang.String),MISSING(scala.Option)
    at org.json4s.reflect.package$.fail(package.scala:53)
    at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:724)
    at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:767)
    at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:462)
    at org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
    at scala.PartialFunction$$anon$1.applyOrElse(PartialFunction.scala:257)
    at org.json4s.Extraction$.customOrElse(Extraction.scala:780)
    at org.json4s.Extraction$.extract(Extraction.scala:454)
    at org.json4s.Extraction$.extract(Extraction.scala:56)
    at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
    at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:104)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:136)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:134)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
    at scala.collection.AbstractIterator.to(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
    at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
    at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:134)
    at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:128)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:58)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:69)
    at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:228)
    at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:562)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:782)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)

Expected Behavior

Model run successfully

Steps To Reproduce

Run https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings example

Spark NLP version and Apache Spark

spark-nlp version - 5.3.3 Spark version - 3.3.2 Python version - 3.9 Scala version - 212

Type of Spark Application

Python Application

Java Version

jdk-11

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Other packages installed via pip

maziyarpanahi commented 3 months ago

Please provide your full code preferably in Colab so we can reproduce it.

olcayc commented 3 months ago

Hi Maziyar, this is a minimal script to recreate what we're doing. The error happens when sparknlp tries to download a model. This same flow worked correctly for us under Spark 3.0, but somehow it is failing under the Spark 3.3 environment

import sparknlp
from sparknlp.base import EmbeddingsFinisher, DocumentAssembler
from sparknlp.common import AnnotatorType
from sparknlp.annotator import E5Embeddings
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Spark NLP Example") \
    .getOrCreate()

spark.sparkContext.setCheckpointDir("/path/to/checkpoint/dir")

# input_df is a dataframe with column 'text' containing text to embed
input_df = ...

# Build pipeline
documentAssembler = (
    DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = E5Embeddings.pretrained()

embeddingsFinisher = (
    EmbeddingsFinisher()
    .setInputCols(["sentence_embeddings"])
    .setOutputCols("unpooled_embeddings")
    .setOutputAsVector(True)
    .setCleanAnnotations(False)
)

embeddings = embeddings.setInputCols(["document"]).setOutputCol(
    "sentence_embeddings"
)
pipeline = Pipeline().setStages(
    [documentAssembler, embeddings, embeddingsFinisher]
)

input_df = input_df.repartition(400).checkpoint()

result_df = pipeline.fit(input_df).transform(input_df).checkpoint()
olcayc commented 3 months ago

@maziyarpanahi As per the code snippet above, we are not doing anything particularly complex, just generating some embeddings. We get the same error with other pretrained models as well. The code worked under Spark 3.0, but now we are getting this JSON4s parsing error under Spark 3.3.

Is sparknlp 5.3.3 tested under pyspark 3.3.2, jvm/jre 11, scala 212 and python 3.9 ? What's the closest configuration that you've tested successfully on your side.

Siddharth-Latthe-07 commented 1 month ago

This exception typically occurs when the JSON data being parsed does not match the expected format defined by the ResourceMetadata class constructor. This could be due to missing or extra fields, incorrect data types, or changes in the JSON structure. Here are some of the steps that might help and let me know if it doesn't:-

  1. Check JSON Response
  2. Verify Class Constructor
  3. Update Spark NLP Version
  4. Custom Parsing Logic
  5. Inspection:- sample code:-
    
    from pyspark.sql import SparkSession
    import json

Initialize Spark session

spark = SparkSession.builder \ .appName("SparkNLPExample") \ .getOrCreate()

Function to log JSON response

def log_json_response(resource_url): import requests response = requests.get(resource_url) if response.status_code == 200: print(json.dumps(response.json(), indent=4)) else: print(f"Failed to fetch resource: {response.status_code}")

Example resource URL (replace with the actual URL you are using)

resource_url = "https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings" log_json_response(resource_url)


Hope this helps,
Thanks
maziyarpanahi commented 1 month ago

I am not sure what causes this, but please test the latest 5.4.1 release instead just in case. And this is pretty simple setup with your minimum code, it works without worrying about any of those versions:

https://colab.research.google.com/drive/1qgD75n8KcSf5ehkZ7obpDTOls_17fiKJ?usp=sharing