apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.37k forks source link

[Ruby] Improve Ruby's GC integration #40881

Open kou opened 3 months ago

kou commented 3 months ago

Describe the enhancement requested

Ruby doesn't know how much memory is used in Apache Arrow C++/GLib. So Ruby can't detect a good timing to run its GC.

We can give hints for the current memory usage by calling rb_gc_adjust_memory_usage(). Ruby will run its GC the right timing with this.

Component(s)

Ruby

kou commented 3 months ago

Use case: #40798

datbth commented 2 months ago

Question

Hi, may I ask if there is any planning/estimation for this yet? What does the effort look like? Would you need any help?

Background

I'm facing this while trying to fetch data from an Arrow Flight server into Ruby. The testing script looks like this: ```ruby client = ArrowFlight::Client.new(location) reader = client.do_get(payload) # server arrow has 9 cols, 512205 rows, 50000 records per chunk/batch GC.disable data = [] t = Benchmark.realtime do while (chunk = reader.read_next) chunk_records = chunk.data.raw_records data += chunk_records end end ``` Result of `t` (tested on Ruby 2.7.4, 3.2.3; 4 runs per case) * With `GC.disable`: 0.31s * Without `GC.disable`: 0.66s * Running in 4 Threads (each Thread runs the whole codes above): * With `GC.disable`: ranging between 0.61s - 1.1s each * WIthout `GC.disable`: ranging between 2.1s - 3.5s each Sorry if I'm bloating this Issue. But I want to put a bit more clarity on the importance of this enhancement. I'm considering this a blocker for putting my Arrow integration into production.
kou commented 2 months ago

This will be included in 17.0.0. (16.0.0 will not include this.)

It seems that this will not be related to your use case. I think that raw_records is related instead. raw_records creates many Ruby objects. It will make GC heavy. Why do you want to use raw_records. In general, you should process Arrow data without raw_records. raw_records is optimized but the conversion copies data that Apache Arrow want to avoid.

datbth commented 2 months ago

Oh ok, thank you for your response. For my case, I'm doing some (legacy) post-processing in Ruby. I will try to study Ruby GC more.

Thank you!