Closed yukah1 closed 3 months ago
Hi @yukah1
The error you're encountering stems from compatibility issues between ONNX's native libraries and GLIBC, as indicated by the error message. The default AMI instances in EMR 6.15.0 come with older versions of GLIBC. To address this issue, you have two options:
Hope this helps!
I confirmed it works well. Thank you for your help!
Is there an existing issue for this?
Who can help?
No response
What are you working on?
Crashe when trying to use a pre-trained model for embedding in AWS EMR Studio: Workspaces.
Current Behavior
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.UnsatisfiedLinkError: /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /mnt2/yarn/usercache/livy/appcache/application_1709534930909_0001/container_1709534930909_0001_01_000001/tmp/onnxruntime-java7731626932550009587/libonnxruntime.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:782) at java.lang.System.load(System.java:1100) at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:365) at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:156) at ai.onnxruntime.OrtEnvironment.(OrtEnvironment.java:33)
at com.johnsnowlabs.ml.onnx.OnnxSession.getSessionOptions(OnnxSession.scala:30)
at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:124)
at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:115)
at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:83)
at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readOnnxModel(E5Embeddings.scala:477)
at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel(E5Embeddings.scala:416)
at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.readModel$(E5Embeddings.scala:407)
at com.johnsnowlabs.nlp.embeddings.E5Embeddings$.readModel(E5Embeddings.scala:477)
at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1(E5Embeddings.scala:425)
at com.johnsnowlabs.nlp.embeddings.ReadE5DLModel.$anonfun$$init$$1$adapted(E5Embeddings.scala:425)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:515)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:507)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:713)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Expected Behavior
A Spark DatafFrame containing vectors should return.
Steps To Reproduce
1. Create EMR cluster
Setting Amazon EMR version emr-6.15.0 Installed applications Hadoop 3.3.6, Hive 3.1.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark 3.4.1
Instance Primary: m5.xlarge Core: m7i Instances
Config&Bootstrap actions https://github.com/JohnSnowLabs/spark-nlp#emr-cluster In addition, the number of cores, driver memory, and other settings
2. Attach the cluster to a notebook
Start the Notebook in EMR Studio and attach the cluster we started.
3. Run the codes
reference: https://sparknlp.org/docs/en/transformers#e5embeddings
Note
The "explain_document_ml" sample and the Doc2Vec annotators seem to work correctly.
explain_document_pipeline = PretrainedPipeline("explain_document_ml") annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP") print(annotations)
import sparknlp from sparknlp.base import from sparknlp.annotator import from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document")
tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token")
embeddings = Doc2VecApproach() \ .setInputCols(["token"]) \ .setOutputCol("embeddings")
pipeline = Pipeline() \ .setStages([ documentAssembler, tokenizer, embeddings ]) data = spark.createDataFrame([["This is a sentence."]]).toDF("text") result = pipeline.fit(data).transform(data)