JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.83k stars 710 forks source link

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

Closed maziyarpanahi closed 2 months ago

maziyarpanahi commented 9 months ago

Is there an existing issue for this?

Who can help?

@danilojsl

What are you working on?

Downloading and loading models on ONNX over GPU devices crashes. (at least on T4 on Colab)

Current Behavior

Crashes with:

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

    at ai.onnxruntime.providers.OrtCUDAProviderOptions.add(Native Method)
    at ai.onnxruntime.providers.OrtCUDAProviderOptions.<init>(OrtCUDAProviderOptions.java:44)
    at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToCUDASessionConfig(OnnxWrapper.scala:152)
    at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToSessionOptionsObject(OnnxWrapper.scala:136)
    at com.johnsnowlabs.ml.onnx.OnnxWrapper$.com$johnsnowlabs$ml$onnx$OnnxWrapper$$withSafeOnnxModelLoader(OnnxWrapper.scala:90)
    at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:122)
    at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:98)
    at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:75)
    at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readOnnxModel(MPNetEmbeddings.scala:471)
    at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel(MPNetEmbeddings.scala:416)
    at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel$(MPNetEmbeddings.scala:407)
    at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readModel(MPNetEmbeddings.scala:471)
    at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1(MPNetEmbeddings.scala:424)
    at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1$adapted(MPNetEmbeddings.scala:424)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
    at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
    at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
    at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
    at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
    at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
    at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)

Expected Behavior

Should work before upgrading to newer version of Spark NLP

Steps To Reproduce

!pip install spark-nlp pyspark

embeddings = MPNetEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

Spark NLP version and Apache Spark

Spark NLP version 5.2.0 Apache Spark version: 3.5.0

Type of Spark Application

Python Application

Java Version

11

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

danilojsl commented 9 months ago

Hi @maziyarpanahi

I haven't been able to replicate the error. I tried in Google Colab with T4 but it is working for spark-np 5.2.0. Can you take a look at this notebook, reproduce the error and let me know MPNet notebook

maziyarpanahi commented 9 months ago

Hi @danilojsl

You forgot to load ONNX GPU build in start function: spark = sparknlp.start(gpu=True). Once the session is started with the GPU build of ONNX and TF, the ONNX models will fail with that error

maziyarpanahi commented 9 months ago

Some extra information, I can use A100 GPUs without any issue. So this must be something with Colab itself, it is either missing something (lib) or it has them but a different versions. (usually older, so for GPU we usually do something in the Colab script-setup to fix those)

@danilojsl Let's find out what's missing and how to fix them, then we can modify the GPU installation for Colab accordingly:

image
danilojsl commented 2 months ago

Hi @maziyarpanahi

This issue is no longer presented with the latest update of ONNX in spark-nlp 5.4.0.

maziyarpanahi commented 2 months ago

Thanks @danilojsl