apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

[Java][C++] Separate JNI compilation & linking from main arrow CMakeLists #32306

Open asfimport opened 2 years ago

asfimport commented 2 years ago

We need to separate the JNI elements from CMakeLists, with related modifications to the CI build scripts likely. Separating the JNI portion serves two related purposes:

  1. Simplify building JNI code against precompiled lib arrow C++ code
  2. Enable control of JNI build through Maven, rather than requiring Java devs to work with CMake directly

    @davisusanibar

    @kou  

Reporter: Larry White / @lwhite1

Subtasks:

Note: This issue was originally created as ARROW-16992. Please see the migration documentation for further details.

asfimport commented 2 years ago

Kouhei Sutou / @kou: I consider this and here are my opinions:

asfimport commented 2 years ago

Larry White / @lwhite1: Thank you @kou. This is super helpful. I have a few thoughts/comments/questions. 

It seems that we need to use Native Maven Plugin: https://www.mojohaus.org/maven-native/native-maven-plugin/ I think this is correct. There is a decent post on how this works here: https://medium.com/geekculture/a-simple-guide-to-java-native-interface-jni-using-native-maven-plugin-e01f4077a8a5 Native Maven Plugin doesn't have a feature to choose shared object type for the current environment (e.g. .so for Linux) The post mentions another limitation in there not being an Environment Factory for a recent MSVC compiler (the author created one for MSVC2017), but the project does seem reasonably active. https://github.com/mojohaus/maven-native   -> Normal Java developers don't need to think about JNI and CMake. Java developers who work on JNI too need to know CMake but it will not too hard because they know C++. I think this is a key point, but I would twist it around to another perspective: If we follow a strategy of wrapping as much native code as possible (to avoid rewriting in Java), then most contributors will need to work on JNI. As you mentioned, they will need to be able to at least read C++ code, and so will probably have some familiarity with CMake. They will also incur the other overhead of cross-platform development (e.g. having the correct compiler installed, configuring an IDE properly, long build times), even if we do the build using Maven. Unfortunately, JNI expertise is not widespread in the Java community. Overall, I don't think wrapping JNI compilation with Maven will move the needle much in terms of opening up Arrow Java to POJD (plain old Java developers), even if we are able to address, for example, the limitations in the Native Maven plugin.  We should use CMake for building share libraries ({}.so{}, .dylib and {}.dll{}) for JNI

We should remove all JNI related codes from cpp/

We should add java/CMakeLists.txt to build all shared libraries for JNI I agree with all these points. @davisusanibar, what do you think? I can provide a PoC implementation of this idea if you need. That would be awesome. Thank you. 

 

asfimport commented 2 years ago

David Dali Susanibar Arce / @davisusanibar: I agree with all of these points.

 

PoC could help us a lot to have an idea about how the JNI java modules projects are building isolated and then try to call that building execution by Maven side.

asfimport commented 2 years ago

Kouhei Sutou / @kou: OK. I'll create a top-level CMakeLists.txt in ARROW-17080 and move datasets build configuration to java/ from cpp/ in ARROW-17081. Other components such as Gandiva and Plasma will be able to follow this approach by using them as an example.