apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[Ruby] JRuby support atop Java Arrow bindings #35589

Open headius opened 1 year ago

headius commented 1 year ago

Describe the enhancement requested

JRuby users would benefit from support in red-arrow but currently the only Ruby bindings available use a native exception. Since JRuby does not support native extensions, we would want to add support by leveraging the Java bindings for Arrow.

This could be done in pure Ruby using JRuby's Java integration layer, and as needed for performance we could move some of that code into Java later.

I would be willing to help with this but I am unfamiliar with Arrow and the Ruby API that wraps it. JRuby's Java integration is very easy to use, however, and mimicking the C extension using JRuby + Ruby + Java integration should go pretty quickly.

Component(s)

Ruby

kou commented 1 year ago

Great!

The Java implementation is available at https://repo.maven.apache.org/maven2/org/apache/arrow/ . Does JRuby have a standard way to install Java packages that are available in a Maven repository?

I'm not familiar with the Java implementation API yet but we'll be able to wrap the API step by step. I think that we should wrap ValueVector https://arrow.apache.org/docs/java/vector.html as Arrow::Array as the first step.

eregon commented 1 year ago

Would using FFI work for this? If so it would be a single implementation/binding instead of multiple.

kou commented 1 year ago

FFI doesn't work...

If we use https://github.com/mvz/gir_ffi instead of https://github.com/ruby-gnome/ruby-gnome/tree/master/gobject-introspection only for non-CRuby implementations, we may be able to use Apache Arrow C++ as the bindings target.

headius commented 1 year ago

I am working on an example using the Java implementation! I'll have something to show you shortly.

headius commented 1 year ago

FYI I don't know whether to file this but the documentation on the Java impl is out of date; it shows installing 9.0.0 and then uses classes like RootAllocator that do not appear to exist in that version. The javadocs seem to point at 12.0.0 so I'm trying that.

headius commented 1 year ago

Oh actually the docs just show arrow-memory-netty but the RootAllocator and other classes are in arrow-memory (which probably gets installed as a dependency?) I just need to know the actual jars to load in JRuby. Getting close.

kou commented 1 year ago

We should update outdated documents. Could you open a separate issue for it with the outdated document's URL?

BTW, https://arrow.apache.org/cookbook/java/ may help you.

headius commented 1 year ago

Ok here's the JRuby version of the simple vector example. I'm trying to work out the best way to pull the jars and make it easily runnable for you:

java_import org.apache.arrow.memory.RootAllocator
java_import org.apache.arrow.vector.IntVector

begin
  allocator = RootAllocator.new
  int_vector = IntVector.new("fixed-size-primitive-layout", allocator)

  int_vector.allocate_new(3)
  int_vector.set(0,1)
  int_vector.set_null(1)
  int_vector.set(2,2)
  int_vector.set_value_count(3);
  puts "Vector created in memory: #{int_vector}"
ensure
  int_vector.close rescue nil
  allocator.close rescue nil
end

When all necessary dependency jars are loaded (into JRuby via CLASSPATH env or require each jar), this should work.

headius commented 1 year ago

Success! Though I think it would be better to set up the proper jar-dependencies logic instead of hand-requiring these jars.

After installing arrow-vector and arrow-memory-netty like this:

$ mvn dependency:get -DgroupId=org.apache.arrow -DartifactId=arrow-vector -Dversion=12.0.0
...
$ mvn dependency:get -DgroupId=org.apache.arrow -DartifactId=arrow-memory-netty -Dversion=12.0.0
...

I was able to run the following script (the slf4j errors are likely because I just don't have the right jars for it loaded):

require '~/.m2/repository/org/apache/arrow/arrow-vector/12.0.0/arrow-vector-12.0.0.jar'
require '~/.m2/repository/org/apache/arrow/arrow-memory-core/12.0.0/arrow-memory-core-12.0.0.jar'
require '~/.m2/repository/org/apache/arrow/arrow-memory-netty/12.0.0/arrow-memory-netty-12.0.0.jar'
require '~/.m2/repository/org/apache/arrow/arrow-format/12.0.0/arrow-format-12.0.0.jar'
require '~/.m2/repository/io/netty/netty-buffer/4.1.90.Final/netty-buffer-4.1.90.Final.jar'
require '~/.m2/repository/io/netty/netty-common/4.1.90.Final/netty-common-4.1.90.Final.jar'
require '~/.m2/repository/com/google/flatbuffers/flatbuffers-java/1.12.0/flatbuffers-java-1.12.0.jar'
require '~/.m2/repository/org/slf4j/slf4j-api/1.7.36/slf4j-api-1.7.36.jar'

java_import org.apache.arrow.memory.RootAllocator
java_import org.apache.arrow.vector.IntVector

begin
  allocator = RootAllocator.new
  int_vector = IntVector.new("fixed-size-primitive-layout", allocator)

  int_vector.allocate_new(3)
  int_vector.set(0,1)
  int_vector.set_null(1)
  int_vector.set(2,2)
  int_vector.set_value_count(3);
  puts "Vector created in memory: #{int_vector}"
ensure
  int_vector.close rescue nil
  allocator.close rescue nil
end

Running it with the requisite --add-opens flag that the arrow Java bindings need:

$ jruby -J--add-opens -Jjava.base/java.nio=ALL-UNNAMED arrow-vector.rb
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Vector created in memory: [1, null, 2]

So that's a basic start!

A few things to improve before moving forward:

I'm glad this was reasonably easy to get working. How do you want to proceed?

Edit: fixed typos and removed slf4j jars that weren't helping the warning.

headius commented 1 year ago

The only thing that could be updated in the documentation is the version number; 12.0.0 is latest but this page shows how to install 9.0.0:

https://arrow.apache.org/docs/java/install.html

The rest of my issues were just because I was trying not to use jar-dependencies and manually requiring all the jars.

kou commented 1 year ago

OK. I've opened a new issue for the documentation: #35602

kou commented 1 year ago

I confirmed that the command lines and script your provided work on my environment too!

Could you open a pull request that includes the followings?

Then I'll push some commits to integrate the current Ruby implementation and CI configurations to the pull request.

We can work on "Something using JRuby's FFI or Project Panama's native memory access might make sense." to avoid --add-opens after we merge the first pull request.

kou commented 1 year ago

@eregon If you're interesting in Red Arrow for TruffleRuby, please open a new issue for it. I have an idea for it. The current gobject-introspection gem generate bindings at run-time. I think that we can improve the gem to generates Ruby scripts that use Fiddle to use functions defined in C. It will work with TruffleRuby.

headius commented 1 year ago

A Fiddle/FFI version would also work for JRuby, but the shortest path is probably to simply use or wrap the Java API. I will continue along that path for now.

kou commented 1 year ago

I think that JRuby should use the Java API for easy to install. If JRuby uses a Fiddle based approach, JRuby needs to install the C++ and C libraries (*.so/*.dylib/*.dll) instead of *.jar.