How to speed up creating dataframe faster for large dataset

Hi, I am creating dataframe for 3.5m records and 25 vector. it is taking over 1min.

# construct data for 3.5m records and close to 25 same key element in each hash.
data = [
  {m: 'abc', a: 1.2, b: 2.1, c: 2.3},
  {m: 'xyz', a: 1.1, b: 22.1, c: 223.3}
  ...
]

# Convert from array of hash to hash of array
vc = {}
data.first.keys.each do |ky|
  vc[ky] = data.map{|dt| dt[ky]}
end

Benchmark.bm do |x|
  x.report("df array_of_hash: ") { Daru::DataFrame.new(data, clone: false) }
  x.report("df hash_of_array: ") { Daru::DataFrame.new(vc, clone: false) }
end

##
#                              user     system      total        real
# df array_of_hash:   86.398855   0.311986  86.710841 ( 86.850770)
# df hash_of_array:   21.745897   0.027261  21.773158 ( 21.814447)

After converting data (which also took a min), it is little faster but 21 sec is still a lot of time to create dataframe.

Any ideas how to speed this up?

SciRuby / daru

How to speed up creating dataframe faster for large dataset #546