Open sivagollapalli opened 7 years ago
Unfortunately, that is for now "how it is". Nothing is broken in library itself that easily can be optimized. When profiling this case, most of the time spent into Ruby's CSV parser, in things like converting strings to numbers and so on. There are some plans/efforts to replace Ruby's default CSV library with something faster, but none of them are close to implementation.
I think we should shift to a C based CSV parser like paratext. See https://github.com/SciRuby/daru/issues/170
What about smarter_csv?
Here is a ~10x speed-up improvement workaround over the default :from_csv method. The CSV file in this case is ~65k rows and 40 columns. I expect the performance improvement to roll off as the CSV file size increases because this method loads the whole file into memory.
@athityakumar can you explore if it is possible to optimize from_csv using @info-rchitect's method?
@v0dro - Sure, I'll try re-creating the benchmarks (also with smarter_csv
) and get back in a couple of days.
Hey all. Please find the benchmark results for importing from csv below, which was done on @sivagollapalli's input csv file, which should create a <550242*8> Daru::DataFrame
.
@v0dro - As suggested by @zverok, the default options of #from_csv
(mainly converters: :numeric
) are slowing it down. Would it be better if it isn't set as default?
The approach suggested by @info-rchitect does come out to be atleast 5-6x faster than the existing #from_csv
method. But that's because, it's not used with converters (convert to numeric, etc. which is set by default in #from_csv
) and stuff. It finally scales out to be very close to the existing #from_csv
, when used with converters: :numeric, header_converters: :symbol
.
In this regard, rcsv
seems like a better alternative as it supports column-specific options
like converters.
Though smarter_csv
gives lots of user options, it by itself is much slower in parsing than stdlib csv.
For using smarter_csv
with parallel processing of chunks, these chunks are available only after the whole CSV files has been read and isn't async yet (See https://github.com/tilo/smarter_csv/issues/66).
smarter_csv
+ Parallel
does speed up parsing by 5-6 times (when compared to without
parallel), and 1.5-2 times when compared to existing #from_csv
. The one shown in benchmark
is with Parallel
.
I think it'd be great to have support for both fastcsv and rcsv in daru (now that daru-io supports partial requires), while still keeping stdlib csv as the default parsing gem. Both are almost equally fast and 20x faster than existing #from_csv
. 🎉
CSV Importer Comparisons
user system total real
existing #from_csv 99.380000 1.290000 100.670000 (142.255080)
modified #from_csv 17.340000 0.470000 17.810000 ( 40.721766)
smartercsv 65.400000 1.000000 66.400000 (116.842460)
fastcsv 7.660000 0.380000 8.040000 ( 13.105506)
rcsv 5.850000 0.210000 6.060000 ( 7.756945)
Also, have a look at this benchmark.
@sivagollapalli - For your current use-case, please try using rcsv
/ fastcsv
as a workaround.
require 'daru'
require 'rcsv'
df = Daru::DataFrame.rows Rcsv.parse(File.open('path/to/water.csv'))
require 'daru'
require 'fastcsv'
all = []
File.open('path/to/water.csv') { |f| FastCSV.raw_parse(f) { |row| all.push row } }
df = Daru::DataFrame.rows all[1..-1], order: all[0]
Converters is definitely slow in CSV. So probably we should go around them (or around entire library) for any reasonable speed improvements.
I wonder if :converters
need to be re-looked at from a design perspective. Currently :numeric
is supported but that's a bit of a special case IMHO. From a design perspective, If you look at issue #353 regarding date column support in CSVs, should the date column support be a new :converter
or a date_columns
argument to the from_csv
function as mentioned in the report? I feel whatever solution finally is picked should support "converters" in a standard way with perhaps some like numeric and date provided by Daru but also allowing a user to add to these converters easily. The way to add/extend converters should standardized and well documented. There may be other cases for converters like IP address etc. which the user may have.
Regarding the numeric converter, I noticed that it tends to mix types a bit. While reading a values such as 10.1, 20, 12.22, 15, 111.2
it ends up typing 20
and 15
as Integer
while the rest are Float
. Perhaps there needs to be a more precise control on the type to avoid subtle bugs e.g. if a novice user thinks the entire column is Float
and ends up getting a 6
for 20/3
and expects a 6.67
thinking the column read is a Float
. Basically, I feel any converter should convert a column to only a mix of a "specific" type or nil
based on content.
Perhaps having a standardized way to handle converters would help us decouple converters and CSV reading performance concerns. This would also future proof us a bit in handling any new column types the users might need.
@parthm - Daru::DataFrame#from_csv
's :converters
sets the default as :numeric
, and passes it onto stdlib CSV
. Hence, I don't think date_columns
should be clubbed with :converter
option, but rather be a separate date_columns
option.
I partly feel that de-coupling converters from CSV parsing might be better. But, we also have to consider that someone who has been using a specific CSV parsing gem (say, rcsv
) would feel more comfortable, when the corresponding Daru::DataFrame#from_rcsv
method's options are directly passed on to rcsv
's options (like column-specific converters, which is already implemented in rcsv
) rather than custom converter options.
I understand how automatic conversion could useful for quick ad hoc data analysis, but as an ETL developer using Daru, I need to maintain tight control of my data types (e.g., I want my job to fail early if I get a CSV file containing a string in a date column; or, I want to make sure that leading zeroes are retained for business id columns). To this end, I have turned off all automatic type conversion for CSV reading. Everything is read in as a string, and I have built custom methods to enforce types.
@gnilrets do you expect csv "booleans" to convert to Ruby booleans or stay as strings? Pandas seems to default to converting to booleans while Daru seems to keep them as strings.
I guess if you're using automatic conversion, then yes, they should convert to booleans.
What do you mean by automatic conversion? I just used the standard Daru::DataFrame.from_csv(file)
and expected that they would convert but it seems they don't.
By default Daru uses automatic conversion from the core Ruby CSV reader. Sounds like it might not be doing so for booleans.
It looks like the Ruby CSV reader doesn't have a built-in converter for booleans. Should we handle this in Daru?
@db579 works!
I just tried to read a csv file of size 42.6 MB which consists of 550241 records. When I tried with daru here is following statistics
if I do the same with python then it shows as follows
Since Github doesn't allow a file to upload more than 10 MB so I couldn't upload. But you can get the data from https://www.kaggle.com/venkatramakrishnan/india-water-quality-data