SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 140 forks source link

Faster loading from CSV files. #31

Open gnilrets opened 9 years ago

gnilrets commented 9 years ago

I'm really loving the idea of this project. My only concern is performance. Reading from a 4,000 line CSV file is taking 7s (WAY too long if I'm going to try to scale to even small data sizes on the order of 100k rows). I was going to try using NMatrix, but don't see how I could try that when reading from a CSV. For example, how could I convert something like this to use NMatrix?

df = Daru::DataFrame.from_csv 'myfile.txt', { headers: true, col_sep: "\t", encoding: "ISO-8859-1:UTF-8" }

Any other ideas on how to improve performance?

v0dro commented 9 years ago

Daru is currently using the default ruby CSV library (written in pure ruby) for reading CSV files so thats a botteneck we cant avoid.

But there a bunch of options that you can specify to daru for speed, mainly things which avoiding cloning data or populating missing values.

For example, set the variable lazy_update to true. So Daru.lazy_update = true will delay updating the dataframe's missing values tracking mechanism until you call #update). See this notebook.

Passing the clone: false option will avoid cloning of the columns that have been read by the CSV file. It is true by default so you might want to change that.

Here is an example of daru being used for larger data.

gnilrets commented 9 years ago

Thanks for the quick response. I tried some of your suggestions, but they didn't seem to help. The best I could do was convert my CSV into a hash of arrays and create the dataframe from that (speed up of about 2x, which is still pretty slow compared to just reading the CSV, which is pretty slow compared to non-Ruby CSV readers). I put up a gist with results here if you're interested: https://gist.github.com/gnilrets/611d85d5cb87fa31bb8a

I've struggled with getting Ruby to perform with larger data sets (https://github.com/gnilrets/Remi), and I worry that the language just isn't up to the task. Would love to be proven wrong.

zverok commented 8 years ago

@gnilrets can you provide your test CSV (if it is not some very private data)? I'm now checking performance here and there during refactoring, and maybe there could be some things that could be improved immediately.

gnilrets commented 8 years ago

I can't supply the CSV I used in that test. But here's some publicly available data from medicare (too big to attach directly, but still only a few 10k records): https://www.medicare.gov/download/DownloaddbInterim.asp

gnilrets commented 8 years ago

I got similar relative benchmarks using both some of the wide and long datasets. Basically, Daru seems to take about 3-4x as long as just parsing the CSV. My suspicion is Daru uses CSV#by_col. If we process rows one-by-one to load a hash of arrays, we can improve the load process by about 2x.

v0dro commented 8 years ago

Alternatively we can create a C extension over libcsv as an nmatrix plugin and use that for loading data into dataframes.

https://github.com/SciRuby/nmatrix/issues/407

zverok commented 8 years ago

Yes, but C extension is always somewhat like "last resort" (and JRuby guys will hate it, I suppose), so my first guess is always trying to profile/optimize Ruby.

Currently, I've investigated it to the point where it can be understood that CSV library itself performs pretty bad when provided with :numeric converter. I'll try to invent something simple-yet-clever around it :)

v0dro commented 8 years ago

No we'll keep it MRI specific. JRuby should have another library for CSV importing (I think jCSV from Rodrigo Botafogo can do the job - https://github.com/rbotafogo/jCSV

v0dro commented 8 years ago

170