Open nimesh1601 opened 5 months ago
Please provide your full code preferably in Colab so we can reproduce it.
Hi Maziyar, this is a minimal script to recreate what we're doing. The error happens when sparknlp tries to download a model. This same flow worked correctly for us under Spark 3.0, but somehow it is failing under the Spark 3.3 environment
import sparknlp
from sparknlp.base import EmbeddingsFinisher, DocumentAssembler
from sparknlp.common import AnnotatorType
from sparknlp.annotator import E5Embeddings
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
# Initialize Spark Session
spark = SparkSession.builder \
.appName("Spark NLP Example") \
.getOrCreate()
spark.sparkContext.setCheckpointDir("/path/to/checkpoint/dir")
# input_df is a dataframe with column 'text' containing text to embed
input_df = ...
# Build pipeline
documentAssembler = (
DocumentAssembler().setInputCol("text").setOutputCol("document")
)
embeddings = E5Embeddings.pretrained()
embeddingsFinisher = (
EmbeddingsFinisher()
.setInputCols(["sentence_embeddings"])
.setOutputCols("unpooled_embeddings")
.setOutputAsVector(True)
.setCleanAnnotations(False)
)
embeddings = embeddings.setInputCols(["document"]).setOutputCol(
"sentence_embeddings"
)
pipeline = Pipeline().setStages(
[documentAssembler, embeddings, embeddingsFinisher]
)
input_df = input_df.repartition(400).checkpoint()
result_df = pipeline.fit(input_df).transform(input_df).checkpoint()
@maziyarpanahi As per the code snippet above, we are not doing anything particularly complex, just generating some embeddings. We get the same error with other pretrained models as well. The code worked under Spark 3.0, but now we are getting this JSON4s parsing error under Spark 3.3.
Is sparknlp 5.3.3 tested under pyspark 3.3.2, jvm/jre 11, scala 212 and python 3.9 ? What's the closest configuration that you've tested successfully on your side.
This exception typically occurs when the JSON data being parsed does not match the expected format defined by the ResourceMetadata class constructor. This could be due to missing or extra fields, incorrect data types, or changes in the JSON structure. Here are some of the steps that might help and let me know if it doesn't:-
from pyspark.sql import SparkSession
import json
spark = SparkSession.builder \ .appName("SparkNLPExample") \ .getOrCreate()
def log_json_response(resource_url): import requests response = requests.get(resource_url) if response.status_code == 200: print(json.dumps(response.json(), indent=4)) else: print(f"Failed to fetch resource: {response.status_code}")
resource_url = "https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings" log_json_response(resource_url)
Hope this helps,
Thanks
I am not sure what causes this, but please test the latest 5.4.1
release instead just in case. And this is pretty simple setup with your minimum code, it works without worrying about any of those versions:
https://colab.research.google.com/drive/1qgD75n8KcSf5ehkZ7obpDTOls_17fiKJ?usp=sharing
Is there an existing issue for this?
Who can help?
No response
What are you working on?
Trying out an example similar to https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbedding
Current Behavior
We are getting json4s exception while spark-nlp is trying to get resource metadata Exception stacktrace
Expected Behavior
Model run successfully
Steps To Reproduce
Run https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings example
Spark NLP version and Apache Spark
spark-nlp version - 5.3.3 Spark version - 3.3.2 Python version - 3.9 Scala version - 212
Type of Spark Application
Python Application
Java Version
jdk-11
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
Other packages installed via pip