apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.62k stars 3.55k forks source link

How to convert a CSV file to RubyArray faster? #5037

Closed kojix2 closed 5 years ago

kojix2 commented 5 years ago

Hello.

I have a question about RedArrow (Ruby Binding).

My goal is to read the CSV file with RedArrow and create a Ruby array from the columns. But it doesn't get as fast as I thought.

Here is my benchmark.

uc.csv is a 1GB csv file. (I downloaded a tsv file from International Inflammatory Bowel Disease Genetics Consortium, renamed some columns, and converted them to csv.)

require 'benchmark'

# For verification
# t = Arrow::Table.load("uc.csv")
# chr  = t[:chr].to_a
# pos  = t[:pos].to_a
# pval = t[:pval].to_a
# correct_array = [chr, pos, pval]

Benchmark.bm 12 do |r|

  r.report "RedArrow" do
    t = Arrow::Table.load("uc.csv")
    chr  = t[:chr].to_a
    pos  = t[:pos].to_a
    pval = t[:pval].to_a
    # puts [chr, pos, pval] == correct_array
  end

  r.report "FastestCSV" do
    chr  = []
    pos  = []
    pval = []
    FastestCSV.foreach("uc.csv") do |row|
      chr  << row[0].to_i
      pos  << row[2].to_i
      pval << row[10].to_f
    end
    # remove headers
    chr.shift; pos.shift; pval.shift
    # puts [chr, pos, pval] == correct_array
  end

  r.report "CSV" do
    chr  = []
    pos  = []
    pval = []
    CSV.foreach("uc.csv", headers: true) do |row|
      chr  << row[0].to_i
      pos  << row[2].to_i
      pval << row[10].to_f
    end
    # puts [chr, pos, pval] == correct_array
  end
end

Result

Fastest-csv is the fastest, and RedArrow is the slowest.

                   user     system      total        real
 RedArrow     329.295902   2.307186 331.603088 (319.347072)
 FastestCSV    22.829663   0.335860  23.165523 ( 23.178630)
 CSV          113.367805   0.363798 113.731603 (113.775625)

I am not familiar with RedArrow. So I may have made an elementary mistake. Any suggestions are welcome. Thank you.

kou commented 5 years ago

Use Arrow::Table#raw_records.

t.select_columns(:chr, :pos, :pval).raw_records
kojix2 commented 5 years ago

I tried the benchmark again.

Benchmark.bm 12 do |r|

  r.report "RedArrow" do
    t = Arrow::Table.load("uc.csv")
    t.select_columns(:chr, :pos, :pval).raw_records
    # puts [chr, pos, pval] == correct_array
  end
                   user     system      total        real
RedArrow      11.680318   2.015998  13.696316 (  1.622647)
FastestCSV    18.993417   0.357121  19.350538 ( 19.361992)
CSV          112.687737   0.262550 112.950287 (112.994485)

Now Red Arrow is faster than fastest-csv. The result is very impressive.

RedArrow doesn't consume much memory. Any column can be converted to a Ruby array when needed. It'll be definitely useful.

Thank you very much!

kojix2 commented 5 years ago

Oh, I have to add transpose. It looks a bit messy.

chr, pos, pval = t.select_columns(:chr, :pos, :pval).raw_records.transpose

This does not affect performance, though.

                   user     system      total        real
RedArrow      12.714012   3.208401  15.922413 (  2.861562)
FastestCSV    18.009612   0.315168  18.324780 ( 18.335624)
CSV          110.762741   0.387075 111.149816 (111.192139)