JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.8k stars 707 forks source link

Mac M1: `jnitensorflow` error with `BertEmbeddings.pretrained` #13079

Closed rwoodard-prog closed 1 year ago

rwoodard-prog commented 1 year ago

(BTW, JSL team does great work--thank you!)

Description

On Mac M1, BertEmbeddings.pretrained() crashes with error:

no jnitensorflow in java.library.path

I recognize that Tensorflow and SparkNLP on Mac M1 is a long, ongoing discussion and I have read many, many online posts, issues, PRs, etc. I am posting this issue because JSL installation instructions imply that all should work on a Mac M1. I am hoping to consolidate and clarify discussions in this issue.

Is it truly a bug or just one slightly wrong java/scala/spark/JSL env var for me?

I cross posted this issue w/ a known working JSL demo project at https://github.com/maziyarpanahi/spark-nlp-starter/issues/1.

Thank you for any help with this.

Expected Behavior

It should not crash.

Current Behavior

Startup:

$ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
...
com.johnsnowlabs.nlp#spark-nlp-m1_2.12 added as a dependency
...
    found com.johnsnowlabs.nlp#tensorflow-m1_2.12;0.4.3 in central
:: resolution report :: resolve 995ms :: artifacts dl 43ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
...
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

Code:

scala> import com.johnsnowlabs.nlp.SparkNLP

scala> val spark = SparkNLP.start(m1 = true)

scala> import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline

scala> val explainDocumentPipeline = PretrainedPipeline("explain_document_ml")
explain_document_ml download started this may take some time.
Approximate size to download 9.2 MB
Download done! Loading the resource.
explainDocumentPipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline =
   PretrainedPipeline(explain_document_ml,en,public/models,false,None)

scala> val annotations = explainDocumentPipeline.annotate(
    "We are very happy about SparkNLP")
annotations: Map[String,Seq[String]] = Map(
    document -> List(We are very happy about SparkNLP), 
    spell -> ArraySeq(We, are, very, happy, about, SparkNLP), 
    pos -> ArrayBuffer(PRP, VBP, RB, JJ, IN, NNP), 
    lemmas -> ArraySeq(We, be, very, happy, about, SparkNLP), 
    token -> ArraySeq(We, are, very, happy, about, SparkNLP), 
    stems -> ArraySeq(we, ar, veri, happi, about, sparknlp), 
    sentence -> ArraySeq(We are very happy about SparkNLP))

scala> import com.johnsnowlabs.nlp.annotator.BertEmbeddings
import com.johnsnowlabs.nlp.annotator.BertEmbeddings

scala> val electra = BertEmbeddings.pretrained("electra_base_uncased", "en")
electra_base_uncased download started this may take some time.
Approximate size to download 389.1 MB
Download done! Loading the resource.
java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path
  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
...
Caused by: java.lang.UnsatisfiedLinkError: Could not find jnitensorflow in class, module, and library paths.
  at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1705)
  ... 97 more

Possible Solution

Steps to Reproduce

  1. See above.

Context

I want to develop and test w/ an IDE on local Mac M1 then deploy to Databricks.

Your Environment

Hardware:

  Model Name:   MacBook Pro
  Model Identifier: MacBookPro17,1
  Chip: Apple M1
  Total Number of Cores:    8 (4 performance and 4 efficiency)
  Memory:   16 GB

Software:

  System Version:   macOS 12.6.1 (21G217)
  Kernel Version:   Darwin 21.6.0
  Boot Volume:  Macintosh HD

java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

scala> SparkNLP.version()
res1: String = 4.2.3

scala> spark.version
res2: String = 3.2.2
DevinTDHa commented 1 year ago

Hi @rwoodard-prog,

Thanks for reporting this, M1 support is still experimental and it is always good to iron out these issues.

I was trying to recreate it (On my 2020 M1 MacBook Air) but for me in even with spark-shell.sh or the spark-submit example from the spark-nlp-starter issue did not result in the behaviour you described.

Could you please provide me with the following information:

rwoodard-prog commented 1 year ago

Thank you for the quick reply and for helping with this. I would not be surprised if it is a classpath thing--that world is always a bit murky for me. Here is more info:

$ arch
arm64

$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home

$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

# Since I have multiple versions of Java on my machine,
# could the shims be an issue?
$ which java
/Users/me/.jenv/shims/java

$ echo $SPARK_HOME
/Users/me/local/tarballs/spark/3.2.2/spark-3.2.2-bin-hadoop3.2

# I downloaded a fresh tarball and checked sigs from
# https://spark.apache.org/downloads.html
$ ls -l /Users/me/local/tarballs/spark/3.2.2/
total 588336
-rw-r--r--@  1     104403 Nov 11 10:17 KEYS.txt
drwxr-xr-x@ 17        544 Jul 11 10:01 spark-3.2.2-bin-hadoop3.2
-rw-r--r--@  1  301112604 Nov 11 10:10 spark-3.2.2-bin-hadoop3.2.tgz
-rw-r--r--@  1        862 Nov 11 10:13 spark-3.2.2-bin-hadoop3.2.tgz.asc
-rw-r--r--@  1        160 Nov 11 10:31 spark-3.2.2-bin-hadoop3.2.tgz.sha512.txt

$ spark-shell \
    --driver-java-options \
    "-Dorg.tensorflow.NativeLibrary.DEBUG=1 -Dorg.bytedeco.javacpp.logger.debug=true" \
    --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

...

scala> val electra = BertEmbeddings.pretrained("electra_base_uncased", "en")
electra_base_uncased download started this may take some time.
Approximate size to download 389.1 MB
Download done! Loading the resource.
Debug: Loading class org.bytedeco.javacpp.presets.javacpp
Debug: Loading class org.bytedeco.javacpp.Loader
Debug: Loading library jnijavacpp
Debug: Failed to load for jnijavacpp: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Debug: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Debug: Loading class org.tensorflow.internal.c_api.global.tensorflow
Debug: Loading class org.tensorflow.internal.c_api.global.tensorflow
Debug: Loading library iomp5
Debug: Failed to load for iomp5: java.lang.UnsatisfiedLinkError: no iomp5 in java.library.path
Debug: Loading library mklml
Debug: Failed to load for mklml: java.lang.UnsatisfiedLinkError: no mklml in java.library.path
Debug: Loading library mklml_intel
Debug: Failed to load for mklml_intel: java.lang.UnsatisfiedLinkError: no mklml_intel in java.library.path
Debug: Loading library tensorflow_framework
Debug: Failed to load for tensorflow_framework@.2: java.lang.UnsatisfiedLinkError: no tensorflow_framework in java.library.path
Debug: Loading library tensorflow_cc
Debug: Failed to load for tensorflow_cc@.2: java.lang.UnsatisfiedLinkError: no tensorflow_cc in java.library.path
Debug: Loading library jnitensorflow
Debug: Failed to load for jnitensorflow: java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path
java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path

Some poking into the Failed to load lines above for personal curisoity:

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep jnijavacpp
META-INF/native-image/macosx-arm64/jnijavacpp/
META-INF/native-image/macosx-arm64/jnijavacpp/jni-config.json
META-INF/native-image/macosx-arm64/jnijavacpp/reflect-config.json
META-INF/native-image/macosx-arm64/jnijavacpp/resource-config.json
org/bytedeco/javacpp/macosx-arm64/libjnijavacpp.dylib

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep iomp5

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep mklml

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep tensorflow_framework
org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_framework.2.dylib

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep tensorflow_cc
org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_cc.2.dylib

$ jar tf /Users/ryanwoodard/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-m1_2.12-0.4.3.jar | grep jnitensorflow
org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib

I do not think the following matters, but b/c of some network settings on my local and some protobuf issues, the actual command I use to run spark is:

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python \
  SPARK_LOCAL_IP=127.0.0.1 \
  SPARK_MASTER_HOST=127.0.0.1 \
  spark-shell \
  --driver-java-options \
  "-Dorg.tensorflow.NativeLibrary.DEBUG=1 -Dorg.bytedeco.javacpp.logger.debug=true" \
  -c spark.driver.bindAddress=127.0.0.1 \
  --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
rwoodard-prog commented 1 year ago

Huge help to @DevinTDHa over Slack. Thank you!

The problem was my java env and he helped me figure it out. I will put all the steps here for reference but he was guiding my typing.

Above, I showed that my java version seemed correct:

$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

But since I have jenv installed controlling all my java envs, I found out that the correct version was not globally set:

$ jenv versions
  system
* 1.8 (set by /Users/me/.jenv/version)
  1.8.0.292
  openjdk64-1.8.0.292

So I corrected that:

jenv global openjdk64-1.8.0.292

That did not solve the problem, though.

@DevinTDHa suggested I install sdkman. I did that and then installed the java version recommended on the JSL docs page:

$ sdk install java 11.0.17-zulu
Downloading: java 11.0.17-zulu
In progress...
...
Done installing!
Setting java 11.0.17-zulu as default.

$ which java
/Users/me/.sdkman/candidates/java/current/bin/java

$ java -version
openjdk version "11.0.17" 2022-10-18 LTS
OpenJDK Runtime Environment Zulu11.60+19-CA (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM Zulu11.60+19-CA (build 11.0.17+8-LTS, mixed mode)

Since I want to use large embeddings models, I need to allocate the driver memory when I start spark-shell, as described in the JSL README:

$ spark-shell \
  --driver-memory 16g \
  --conf spark.kryoserializer.buffer.max=2000M \
  -c spark.driver.bindAddress=127.0.0.1 \
  --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

Success:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.17)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.johnsnowlabs.nlp.SparkNLP
import com.johnsnowlabs.nlp.SparkNLP

scala> val spark = SparkNLP.start(m1 = true)
22/11/16 10:26:07 WARN SparkSession$Builder: Using an existing SparkSession; some spark core configurations may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5379839c

scala> import com.johnsnowlabs.nlp.annotator.BertEmbeddings
import com.johnsnowlabs.nlp.annotator.BertEmbeddings

scala> val electra = BertEmbeddings.pretrained("electra_base_uncased", "en")
electra_base_uncased download started this may take some time.
Approximate size to download 389.1 MB
Download done! Loading the resource.
2022-11-16 10:26:25.269103: W external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
electra: com.johnsnowlabs.nlp.embeddings.BertEmbeddings = BERT_EMBEDDINGS_eb74c608d962

And there was much rejoicing!

Thank you, @DevinTDHa, and JSL.

maziyarpanahi commented 1 year ago

Thanks for providing this information here @rwoodard-prog I appreciate it.

@DevinTDHa Could you please add these to our docs here if something is missing? https://nlp.johnsnowlabs.com/docs/en/install#installation-for-m1-macs

Many thanks

datarefactorynexus commented 9 months ago

I'm using M3. I added the following dependency to my POM

dependency groupId com.johnsnowlabs.nlp artifactId ensorflow-m1_2.12 version 0.4.4 dependency

And it works like a charm.

maziyarpanahi commented 9 months ago

I'm using M3. I added the following dependency to my POM

dependency groupId com.johnsnowlabs.nlp artifactId ensorflow-m1_2.12 version 0.4.4 dependency

And it works like a charm.

Thanks for the update! It seems the base M1, M2, and M3 are fine with the correct configuration and Java, but the variations like Pro, Max, and Ultra are not the same build and they fail.