Open fyang996 opened 1 year ago
It may try to reextract it if the files are not the same, from different versions of TensorFlow. If this is something that can happen in your environment, we need to make sure "jarname" differ as well...
It is always the same jar, it just distribute the compute work to different executors in Spark which many executors may run on the same machine.
Then it sounds like more than 1 JVM is entering the critical region at the same time, and so the question is, why is the FileLock not working as expected?
Maybe it's because this line should be in the critical region as well: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L666 Can you try that and see if it fixes your issue?
Indeed, that might be the reason!
We actually worked around the issue with another method already, we basically set this property: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L999
We set this property to a random folder name before Loader
get initialized for each JVM. So each JVM end up using different cache dirs even when they are on the same machine. This fixed the issue.
I am not sure how can we change that code to test it, we import tensorflow/java which brings in javacpp.
You just need to override the version of JavaCPP to 1.5.8-SNAPSHOT and it will pick that version instead of the one TF Java sets.
To confirm, so you already made the change in version 1.5.8-SNAPSHOT
?
No, I didn't make any changes, please try to do it locally.
Is the cache getting created on a home mounted with NFS or something? That's known to be tricky to get working with locks. If you set the cache to be something like the local /tmp directory, does this fix the issue?
Yeah that is very likely. This only fails on our spark cluster which I think the home dir is NFS mounted (I need to confirm this though). And as mentioned above, we worked around this issue by setting the cache dir property to a unique (and local) dir per JVM, so they never share the same dir. But I didn't try set it to the same local directory.
Hi,
We recently hit this race condition issue with
javacpp.Loader class
. The issue happens when you have multiple JVMs running on the same machine, like in Spark.It manifests as errors like below when trying to load models using the library.
Looking at the Loader code, the file lock only lock within
cacheResource
method: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L571-L697 And there are logics like deleting the file within the cacheResource method. (although I still don't understand why it enters into the code block at all if others already cached the file)If another JVM on the same machine using the same
cacheDir
(by default/user/home/.javacpp/cache
) try to loadLibrary, it may find the file got deleted by another JVM since there is no file lock here.