SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 140 forks source link

How to speed up creating dataframe faster for large dataset #546

Open rvyas opened 2 years ago

rvyas commented 2 years ago

Hi, I am creating dataframe for 3.5m records and 25 vector. it is taking over 1min.

# construct data for 3.5m records and close to 25 same key element in each hash.
data = [
  {m: 'abc', a: 1.2, b: 2.1, c: 2.3},
  {m: 'xyz', a: 1.1, b: 22.1, c: 223.3}
  ...
]

# Convert from array of hash to hash of array
vc = {}
data.first.keys.each do |ky|
  vc[ky] = data.map{|dt| dt[ky]}
end

Benchmark.bm do |x|
  x.report("df array_of_hash: ") { Daru::DataFrame.new(data, clone: false) }
  x.report("df hash_of_array: ") { Daru::DataFrame.new(vc, clone: false) }
end

##
#                              user     system      total        real
# df array_of_hash:   86.398855   0.311986  86.710841 ( 86.850770)
# df hash_of_array:   21.745897   0.027261  21.773158 ( 21.814447)

After converting data (which also took a min), it is little faster but 21 sec is still a lot of time to create dataframe.

Any ideas how to speed this up?

kojix2 commented 2 years ago

Unfortunately, daru is currently without a developer. I recommend that you create your own fork, give daru another name, such as daru2, and take over the project, or use one of the following alternatives

The former is recommended for general use. The latter is a new data frame with Apache Arrow as its backend. The functionality may be improved in the future.