Open gnilrets opened 9 years ago
Daru is currently using the default ruby CSV library (written in pure ruby) for reading CSV files so thats a botteneck we cant avoid.
But there a bunch of options that you can specify to daru for speed, mainly things which avoiding cloning data or populating missing values.
For example, set the variable lazy_update
to true
. So Daru.lazy_update = true
will delay updating the dataframe's missing values tracking mechanism until you call #update
). See this notebook.
Passing the clone: false
option will avoid cloning of the columns that have been read by the CSV file. It is true
by default so you might want to change that.
Here is an example of daru being used for larger data.
Thanks for the quick response. I tried some of your suggestions, but they didn't seem to help. The best I could do was convert my CSV into a hash of arrays and create the dataframe from that (speed up of about 2x, which is still pretty slow compared to just reading the CSV, which is pretty slow compared to non-Ruby CSV readers). I put up a gist with results here if you're interested: https://gist.github.com/gnilrets/611d85d5cb87fa31bb8a
I've struggled with getting Ruby to perform with larger data sets (https://github.com/gnilrets/Remi), and I worry that the language just isn't up to the task. Would love to be proven wrong.
@gnilrets can you provide your test CSV (if it is not some very private data)? I'm now checking performance here and there during refactoring, and maybe there could be some things that could be improved immediately.
I can't supply the CSV I used in that test. But here's some publicly available data from medicare (too big to attach directly, but still only a few 10k records): https://www.medicare.gov/download/DownloaddbInterim.asp
I got similar relative benchmarks using both some of the wide and long datasets. Basically, Daru seems to take about 3-4x as long as just parsing the CSV. My suspicion is Daru uses CSV#by_col
. If we process rows one-by-one to load a hash of arrays, we can improve the load process by about 2x.
Alternatively we can create a C extension over libcsv as an nmatrix plugin and use that for loading data into dataframes.
Yes, but C extension is always somewhat like "last resort" (and JRuby guys will hate it, I suppose), so my first guess is always trying to profile/optimize Ruby.
Currently, I've investigated it to the point where it can be understood that CSV library itself performs pretty bad when provided with :numeric
converter. I'll try to invent something simple-yet-clever around it :)
No we'll keep it MRI specific. JRuby should have another library for CSV importing (I think jCSV from Rodrigo Botafogo can do the job - https://github.com/rbotafogo/jCSV
I'm really loving the idea of this project. My only concern is performance. Reading from a 4,000 line CSV file is taking 7s (WAY too long if I'm going to try to scale to even small data sizes on the order of 100k rows). I was going to try using NMatrix, but don't see how I could try that when reading from a CSV. For example, how could I convert something like this to use NMatrix?
Any other ideas on how to improve performance?