bytedeco / javacpp

The missing bridge between Java and native C++
Other
4.44k stars 576 forks source link

Race condition in org.bytedeco.javacpp.Loader with multiple JVMs on the same machine #608

Open fyang996 opened 1 year ago

fyang996 commented 1 year ago

Hi,

We recently hit this race condition issue with javacpp.Loader class. The issue happens when you have multiple JVMs running on the same machine, like in Spark.

It manifests as errors like below when trying to load models using the library.

Caused by: java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1875)
    at java.lang.Runtime.loadLibrary0(Runtime.java:872)
    at java.lang.System.loadLibrary(System.java:1124)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1738)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1345)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1157)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1133)
    at org.tensorflow.internal.c_api.global.tensorflow.<clinit>(tensorflow.java:12)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1212)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1157)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1149)
    at org.tensorflow.NativeLibrary.load(NativeLibrary.java:48)
    at org.tensorflow.TensorFlow.<clinit>(TensorFlow.java:140)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.tensorflow.Graph.<clinit>(Graph.java:1341)
        ....
Caused by: java.lang.UnsatisfiedLinkError: /user/home/.javacpp/cache/jarname/org/tensorflow/internal/c_api/linux-x86_64/libjnitensorflow.so: libtensorflow_cc.so.2: cannot open shared object file: No such file or directory
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
    at java.lang.Runtime.load0(Runtime.java:811)
    at java.lang.System.load(System.java:1088)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1685)
    ... 104 more

Looking at the Loader code, the file lock only lock within cacheResource method: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L571-L697 And there are logics like deleting the file within the cacheResource method. (although I still don't understand why it enters into the code block at all if others already cached the file)

                        file.delete();
                        extractResource(resourceURL, file, null, null, true);
                        file.setLastModified(timestamp);

If another JVM on the same machine using the same cacheDir (by default /user/home/.javacpp/cache) try to loadLibrary, it may find the file got deleted by another JVM since there is no file lock here.

saudet commented 1 year ago

It may try to reextract it if the files are not the same, from different versions of TensorFlow. If this is something that can happen in your environment, we need to make sure "jarname" differ as well...

fyang996 commented 1 year ago

It is always the same jar, it just distribute the compute work to different executors in Spark which many executors may run on the same machine.

saudet commented 1 year ago

Then it sounds like more than 1 JVM is entering the critical region at the same time, and so the question is, why is the FileLock not working as expected?

saudet commented 1 year ago

Maybe it's because this line should be in the critical region as well: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L666 Can you try that and see if it fixes your issue?

fyang996 commented 1 year ago

Indeed, that might be the reason!

We actually worked around the issue with another method already, we basically set this property: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L999 We set this property to a random folder name before Loader get initialized for each JVM. So each JVM end up using different cache dirs even when they are on the same machine. This fixed the issue.

I am not sure how can we change that code to test it, we import tensorflow/java which brings in javacpp.

saudet commented 1 year ago

You just need to override the version of JavaCPP to 1.5.8-SNAPSHOT and it will pick that version instead of the one TF Java sets.

fyang996 commented 1 year ago

To confirm, so you already made the change in version 1.5.8-SNAPSHOT?

saudet commented 1 year ago

No, I didn't make any changes, please try to do it locally.

saudet commented 1 year ago

Is the cache getting created on a home mounted with NFS or something? That's known to be tricky to get working with locks. If you set the cache to be something like the local /tmp directory, does this fix the issue?

fyang996 commented 1 year ago

Yeah that is very likely. This only fails on our spark cluster which I think the home dir is NFS mounted (I need to confirm this though). And as mentioned above, we worked around this issue by setting the cache dir property to a unique (and local) dir per JVM, so they never share the same dir. But I didn't try set it to the same local directory.