JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.77k stars 705 forks source link

Show an error of 'GLIBC_2.27 not found' when pretrained model download in AWS EMR #14193

Closed yukah1 closed 3 months ago

yukah1 commented 4 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

Crashe when trying to use a pre-trained model for embedding in AWS EMR Studio: Workspaces.

Current Behavior

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.UnsatisfiedLinkError: /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:782) at java.lang.System.load(System.java:1100) at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:365) at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:156) at ai.onnxruntime.OrtEnvironment.(OrtEnvironment.java:33) at com.johnsnowlabs.ml.onnx.OnnxSession.getSessionOptions(OnnxSession.scala:30) at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:124) at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:115) at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:83) at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readOnnxModel(E5Embeddings.scala:477) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel(E5Embeddings.scala:416) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel$(E5Embeddings.scala:407) at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readModel(E5Embeddings.scala:477) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1(E5Embeddings.scala:425) at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1$adapted(E5Embeddings.scala:425) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:515) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:507) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:713) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

Expected Behavior

A Spark DatafFrame containing vectors should return.

Steps To Reproduce

1. Create EMR cluster

2. Attach the cluster to a notebook

Start the Notebook in EMR Studio and attach the cluster we started.

3. Run the codes

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
embeddings =E5Embeddings.pretrained("e5_base","en") \
            .setInputCols(["documents"]) \
            .setOutputCol("instructor")

reference: https://sparknlp.org/docs/en/transformers#e5embeddings

Note

The "explain_document_ml" sample and the Doc2Vec annotators seem to work correctly.

explain_document_pipeline = PretrainedPipeline("explain_document_ml") annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP") print(annotations)

- Doc2Vec

import sparknlp from sparknlp.base import from sparknlp.annotator import from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document")

tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token")

embeddings = Doc2VecApproach() \ .setInputCols(["token"]) \ .setOutputCol("embeddings")

pipeline = Pipeline() \ .setStages([ documentAssembler, tokenizer, embeddings ]) data = spark.createDataFrame([["This is a sentence."]]).toDF("text") result = pipeline.fit(data).transform(data)



### Spark NLP version and Apache Spark

sparknlp.version: 5.3.0
spark.version: 3.4.1-amzn-2

### Type of Spark Application

Python Application

### Java Version

_No response_

### Java Home Directory

_No response_

### Setup and installation

_No response_

### Operating System and Version

_No response_

### Link to your project (if available)

_No response_

### Additional Information

_No response_
danilojsl commented 3 months ago

Hi @yukah1

The error you're encountering stems from compatibility issues between ONNX's native libraries and GLIBC, as indicated by the error message. The default AMI instances in EMR 6.15.0 come with older versions of GLIBC. To address this issue, you have two options:

Hope this helps!

yukah1 commented 3 months ago

I confirmed it works well. Thank you for your help!