bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

Lib load fails under SBT, succeeds with plain java #1203

Open jxtps opened 2 years ago

jxtps commented 2 years ago

If I create a small sample project with the following class:

package misc;

import org.bytedeco.pytorch.Device;
import org.bytedeco.pytorch.global.torch;

public class MyTest {

    public static void debug(String prop) {
        System.out.println(prop + "=" + System.getProperty(prop));
        System.setProperty(prop, "true");
    }

    public static void main(String[] args) {
        debug("org.bytedeco.javacpp.pathsFirst");
        debug("org.bytedeco.javacpp.logger.debug");
        Device cpu = new Device(torch.DeviceType.CPU);
    }
}

Then I can run that main function "standalone" in IntelliJ without problems, and the loading of a large number of libraries flashes by quickly.

However, if I instead run the play project (by creating a Play 2 run profile in IntelliJ), then when the controller calls MyTest.main() then some of the libraries flash by, but then the library loading hangs on:

Debug: Loading C:\Users\jxtps\.javacpp\cache\bin\libopenblas_nolapack.dll

If I then stop the server, the loading suddenly continues with the same javacpp debug printouts as when I ran MyTest.main() separately, but shortly thereafter the whole process exits.

This is really strange and I have no idea why it's happening. Right now I'm basically stuck with JavaCPP working great in isolation, but as soon as I try to use it within my sbt/play project it just freezes everything.

This is using Oracle Corporation Java 1.8.0_241 on windows, sbt 1.7.1, play 2.8.16, scala 2.13.8 and "org.bytedeco" % "pytorch-platform" % s"1.10.2-1.5.7". I have the relevant libtorch dlls all on the path (hence the org.bytedeco.javacpp.pathsFirst=true).

???

saudet commented 2 years ago

There might be a deadlock happening somewhere. Is it trying to load JavaCPP from multiple threads at the same time?

saudet commented 2 years ago

You may also want to try with "org.bytedeco.javacpp.cachelibraries" and "org.bytedeco.javacpp.findlibraries" set to "false" with the latest snapshots: http://bytedeco.org/builds/

jxtps commented 2 years ago

The snapshot with those additional settings gets stuck on:

Debug: Loading library libopenblas

Running MyTest.main() standalone does not get stuck (but errors our since I don't have the correct 1.12 libs in my path, but that's arguably separate).

Deadlock: There's quite a few threads running, but from what I can tell only a single one is accessing the library at the time of the freeze.

saudet commented 2 years ago

It would help to see the stack trace of that thread, to see on which line it gets stuck.

jxtps commented 2 years ago

Ok, this is interesting, while digging up the stack traces it all of a sudden started working! I had to revert to sbt 1.3.13 due to issues with incremental compilation loops in 1.7.1 on windows, and that fixed it.

When using sbt 1.7.1 it hangs here:

image

image

I figured the line in bytedeco would be the most relevant.

The actual freeze happens in native void load(String name, boolean isBuiltin); which is on line 1719 in java.lang.ClassLoader in my version of java (1.8.0_241).

saudet commented 2 years ago

That means something is interfering with OpenBLAS itself, that's not related to JavaCPP per se. Since MKL is usually faster than OpenBLAS anyway, try to load that instead by setting the "org.bytedeco.openblas.load" system property to "mkl_rt": https://github.com/bytedeco/javacpp-presets/tree/master/openblas#documentation

jxtps commented 2 years ago

Well, when I do that it does try mkl_rt first, but since I don't actually have those libs it falls back on openblas, and then freezes...

What I find really confusing is how the sbt version could have an impact here at all. I guess one of its zillion dependencies does something funky.

saudet commented 2 years ago

I've upgraded the presets for OpenBLAS 0.3.21 (and PyTorch 1.12.1 for that matter), which might contain some fixes in there. Please give it a try with the snapshots: http://bytedeco.org/builds/

saudet commented 1 year ago

It would help to see the stack trace of that thread to see on which line it gets stuck.

jxtps commented 8 months ago

Ok, so I'm back to upgrading our stack, and (most unfortunately) back on this issue. I have created yet another minimally reproducing project, but things have changed:

Java: Amazon.com Inc. Java 17.0.9 (Corretto) Scala: 2.13.12 Sbt: 1.9.8 Play: 2.9.1 "org.bytedeco" % "pytorch-platform" % "2.0.1-1.5.9"

It still works in Java, but now it works under SBT when org.bytedeco.javacpp.logger.debug=true, but fails when org.bytedeco.javacpp.logger.debug=false.

C:\Users\admin\.javacpp\cache\openblas-0.3.23-1.5.9-windows-x86_64.jar\org\bytedeco\openblas\windows-x86_64\libopenblas_nolapack.dll still appears to be the culprit

The line numbers have changed, but the stack trace is very similar to before: image

image

sbt_javacpp2.zip

jxtps commented 8 months ago

Hmm... debug=true doesn't reliably allow it to run, but it sometimes works. I haven't pinned down when/why.

saudet commented 8 months ago

Well, like I said before, if it works fine with MKL, just do that

jxtps commented 8 months ago

Ok, so this appears to be an issue where the initialization code isn't thread safe. I'm not sure why it's getting hit from multiple fronts, but it seems to be. When I add synchronized to a key piece of my external initialization code (MyTest.main in the sample project) it works regardless of if debug is true or false.

(technically I haven't tested the minimally reproducing example, this is in my actual application)

Maybe we should add synchronized to some key function in the load chain hierarchy?

saudet commented 8 months ago

You can try, but I don't think it's going to help because that doesn't synchronize between multiple instances of the same class, so we need to use file synchronization instead and I've heard issues with that not working well on Windows. Please do feel free to debug that though

saudet commented 8 months ago

See issue https://github.com/bytedeco/javacpp/issues/197, for example

jxtps commented 8 months ago

This is a weird one. Last night in my dev branch it was seemingly working fine. Then I backported some of the changes to be able to do a hotfix release, and now it's not working anymore. Very confusing.

Switching to using MKL appears to have fixed it.