yukah1 commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

Crashe when trying to use a pre-trained model for embedding in AWS EMR Studio: Workspaces.

Current Behavior

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.UnsatisfiedLinkError: /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:782) at java.lang.System.load(System.java:1100) at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:365) at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:156) at ai.onnxruntime.OrtEnvironment.(OrtEnvironment.java:33) at com.johnsnowlabs.ml.onnx.OnnxSession.getSessionOptions(OnnxSession.scala:30) at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:124) at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:115) at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:83) at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readOnnxModel(E5Embeddings.scala:477) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel(E5Embeddings.scala:416) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel$(E5Embeddings.scala:407) at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readModel(E5Embeddings.scala:477) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1(E5Embeddings.scala:425) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1$adapted(E5Embeddings.scala:425) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:515) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:507) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:713) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

Expected Behavior

A Spark DatafFrame containing vectors should return.

Steps To Reproduce

1. Create EMR cluster

Setting Amazon EMR version emr-6.15.0 Installed applications Hadoop 3.3.6, Hive 3.1.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark 3.4.1
Instance Primary: m5.xlarge Core: m7i Instances
Config&Bootstrap actions https://github.com/JohnSnowLabs/spark-nlp#emr-cluster In addition, the number of cores, driver memory, and other settings

2. Attach the cluster to a notebook

Start the Notebook in EMR Studio and attach the cluster we started.

3. Run the codes

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
embeddings =E5Embeddings.pretrained("e5_base","en") \
            .setInputCols(["documents"]) \
            .setOutputCol("instructor")

reference: https://sparknlp.org/docs/en/transformers#e5embeddings

Note

The "explain_document_ml" sample and the Doc2Vec annotators seem to work correctly.

"explain_document_ml"


from sparknlp.pretrained import PretrainedPipeline

explain_document_pipeline = PretrainedPipeline("explain_document_ml") annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP") print(annotations)

- Doc2Vec

import sparknlp from sparknlp.base import from sparknlp.annotator import from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document")

tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token")

embeddings = Doc2VecApproach() \ .setInputCols(["token"]) \ .setOutputCol("embeddings")

pipeline = Pipeline() \ .setStages([ documentAssembler, tokenizer, embeddings ]) data = spark.createDataFrame([["This is a sentence."]]).toDF("text") result = pipeline.fit(data).transform(data)



### Spark NLP version and Apache Spark

sparknlp.version: 5.3.0
spark.version: 3.4.1-amzn-2

### Type of Spark Application

Python Application

### Java Version

_No response_

### Java Home Directory

_No response_

### Setup and installation

_No response_

### Operating System and Version

_No response_

### Link to your project (if available)

_No response_

### Additional Information

_No response_

danilojsl commented 3 months ago

Hi @yukah1

The error you're encountering stems from compatibility issues between ONNX's native libraries and GLIBC, as indicated by the error message. The default AMI instances in EMR 6.15.0 come with older versions of GLIBC. To address this issue, you have two options:

Create a custom AMI with GLIBC version 2.27 or newer and use it when configuring your EMR cluster.
Upgrade to the latest EMR cluster version, 7.0.0. I've personally tested this version, and it works flawlessly.

Hope this helps!

yukah1 commented 3 months ago

I confirmed it works well. Thank you for your help!

JohnSnowLabs / spark-nlp

Show an error of 'GLIBC_2.27 not found' when pretrained model download in AWS EMR #14193