Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

Comparison with fast C++ csv parser #2634

Open HughParsonage opened 6 years ago

HughParsonage commented 6 years ago

In https://github.com/jimhester/readr/commit/33b793621c33b915e896fb3778c5a47152ccd73d @jimhester implements a proof-of-concept of https://github.com/ben-strasser/fast-cpp-csv-parser with impressive timings on a static 1.56 GB file (especially on the 'hot' second timing).

> system.time(y <- readr:::read_trip_fare(normalizePath("trip_fare_1.csv")))
   user  system elapsed 
  19.97    1.01   20.15 

> system.time(y <- data.table::fread(normalizePath("trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
  23.88    0.75   18.24

> system.time(y <- readr:::read_trip_fare(normalizePath("trip_fare_1.csv")))
   user  system elapsed 
  12.81    1.08   12.91

> system.time(y <- data.table::fread(normalizePath("trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
  24.36    0.66   17.92 

> dim(y)
[1] 14776615       11

The purpose of this issue is recognize its performance and broadcast awareness of this parser, to anticipate comparisons with fread and to gather what, if anything, can be learnt from this implementation. From what I understand, the function requires a lot of knowledge of the csv's structure well in advance of it being read. (In data.table parlance, perhaps, a 'fast and unfriendly file finagler'.) Nonetheless I believe there is a use-case for such a function: a kind-of plain-text cached version could be very valuable if fast.

st-pasha commented 6 years ago

Thanks Hugh for bringing this to our attention. I wonder if you measured these timings yourself (and if so, what are the specs of the system where this was run), or found on readr's blog somewhere (in which case what version of data.table are they using?). In particular what strikes me as odd is that "user" time is very similar to the "elapsed" time. On my machine user time usually much higher, because fread utilizes multiple cores:

> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
170.614   9.984  24.392 
> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
118.323   6.500  16.293 
> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
115.234   6.260  15.853 
HughParsonage commented 6 years ago

Yes, I ran the timings myself. I used the latest dev version of data.table.

Rerunning today:


> system.time(fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
  29.80    1.38   24.34 

> system.time(fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
   user  system elapsed 
  27.95    0.77   22.06

Intel i7-6800K @ 3.40 GHz
Installed RAM: 128 GB
Windows 10.
MichaelChirico commented 6 years ago

FWIW I noticed a large-ish file I was reading slow down a tad after going from a dev version in Dec/January or so to now (I noticed because before it didn't produce a progress bar, now it does).

I decided to document this by using this script:

https://gist.github.com/MichaelChirico/afb9949027d720629f0934a5398108b7

To run this script:

https://gist.github.com/MichaelChirico/63ae2e4cf87079d9a45b7fb17082820e

The first script contains the commit hash for the 35 most recent commits affecting fread.c. The second uses this as input to install data.table from that commit, then times 5 runs of lapply(f, fread) on 10 files, each with about 4.5 million rows x 26 columns. I can't share the data.

I'm on MacBook Pro (High Sierra 10.13.3 / 2.5 GHz Intel Core i7 / 16GB 1600 MHz DDR3 RAM), here's the result:

fread_timings

Overall there hasn't been crazy variation, but the variation is there. Speed appears to have peaked in early December, and has been creeping up a bit since.

jangorecki commented 6 years ago

@MichaelChirico you be great if you could put those tests in macrobenchmarking/data.table/fread.Rraw

MichaelChirico commented 6 years ago

you mean the scripts? or the output

On Feb 23, 2018 12:12 PM, "Jan Gorecki" notifications@github.com wrote:

@MichaelChirico https://github.com/michaelchirico you be great if you could put those tests in macrobenchmarking/data.table/fread.Rraw

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/2634#issuecomment-367904733, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQQdcdbQdHw81ezoEHnmmHYGJO9jl5eks5tXjqsgaJpZM4SKesK .

st-pasha commented 6 years ago

There's always a trade-off between speed and robustness. For example, a recent change in parsing of doubles added few checks for extra digits and for correctness of the number literal. Those checks probably slowed down the parser by few percent points, but at the benefit of improved functionality. Things like this may add up. On the other hand, there could have been added genuine minor inefficiencies -- hard to know... The overall change that you see is roughly 5% slowdown, so it's not anything dramatic (it looks scary on the chart only because the origin is not at 0).

Also, some time ago there was a change in progress bar logic to make it appear earlier.

MichaelChirico commented 6 years ago

I agree it's minor; I set out to document this given the appearance of a new parser that claims to perform better.

I think it's great for fread to robustly/automatically handle such a diversity of weird input files (which, at scale, are common "in the wild"); as Hugh points out, though, if the user does know that they have a prim-and-proper simple csv, the potential speed-up could be quite large.

MichaelChirico commented 6 years ago

@HughParsonage what's the memory performance of readr vis-a-vis fread for this case? I wonder how much of the second-run performance comes from a more liberal absorption of user memory by readr, perhaps...

HughParsonage commented 6 years ago

I'm embarrassed to say I don't really know how to benchmark memory:

> gc(1,1)
Garbage collection 26 = 16+5+5 (level 2) ... 
25.7 Mbytes of cons cells used (51%)
6.6 Mbytes of vectors used (52%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 480733 25.7     940480 50.3   480733 25.7
Vcells 856189  6.6    1650153 12.6   856189  6.6

> pryr::mem_change(data.table::fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
140 MB

> gc(1,1)
Garbage collection 58 = 29+7+22 (level 2) ... 
32.4 Mbytes of cons cells used (30%)
139.7 Mbytes of vectors used (9%)
           used  (Mb) gc trigger   (Mb) max used  (Mb)
Ncells   605389  32.4    2051488  109.6   605389  32.4
Vcells 18307526 139.7  199271127 1520.4 18307526 139.7

> pryr::mem_change(readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
864 B
> gc(1,1)
Garbage collection 66 = 29+7+30 (level 2) ... 
34.2 Mbytes of cons cells used (33%)
140.0 Mbytes of vectors used (10%)
           used  (Mb) gc trigger   (Mb) max used  (Mb)
Ncells   638792  34.2    1946970  104.0   638792  34.2
Vcells 18338486 140.0  184058497 1404.3 18338486 140.0

> gc(1,1)
Garbage collection 67 = 29+7+31 (level 2) ... 
34.2 Mbytes of cons cells used (33%)
140.0 Mbytes of vectors used (12%)
           used  (Mb) gc trigger   (Mb) max used  (Mb)
Ncells   638797  34.2    1946970  104.0   638797  34.2
Vcells 18338514 140.0  147246797 1123.5 18338514 140.0

Restarting R session...

> gc(1,1)
Garbage collection 22 = 14+4+4 (level 2) ... 
23.7 Mbytes of cons cells used (59%)
6.0 Mbytes of vectors used (47%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 443330 23.7     750400 40.1   443330 23.7
Vcells 783156  6.0    1650153 12.6   783156  6.0

> pryr::mem_change(readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
9.62 MB
> gc(1,1)
Garbage collection 43 = 18+4+21 (level 2) ... 
29.9 Mbytes of cons cells used (27%)
14.6 Mbytes of vectors used (1%)
          used (Mb) gc trigger   (Mb) max used (Mb)
Ncells  559042 29.9    2051488  109.6   559042 29.9
Vcells 1912426 14.6  181141032 1382.0  1912426 14.6

> pryr::mem_change(y <- readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
1.51 GB

> pryr::mem_change(y2 <- data.table::fread(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
1.43 GB
MichaelChirico commented 6 years ago

neither do i tbh.. .valgrind is the buzzword i have in mind 🤔

MichaelChirico commented 6 years ago

https://github.com/burntsushi/xsv

just came across this, not sure if worth a separate issue, but could also be useful to benchmark since it's getting hacker news hype

In the README, I see a timing of reading/summarizing this file:

http://burntsushi.net/stuff/worldcitiespop.csv

In about 12 seconds

On my machine, with data.table, download+read+summarize took 60 seconds; read+summarize took 9.4 out of the box vs:

time xsv stats worldcitiespop.csv --everything | xsv table
field       type     sum                 min            max            min_length  max_length  mean                stddev              median      mode         cardinality
Country     Unicode                      ad             zw             2           2                                                               cn           234
City        Unicode                       bab el ahmar  Þykkvibaer     1           91                                                              san jose     2351892
AccentCity  Unicode                       Bâb el Ahmar  ïn Bou Chella  1           91                                                              San Antonio  2375760
Region      Unicode                      00             Z9             0           2                                                   13          04           397
Population  Integer  2289584999          7              31480498       0           8           47719.570633597126  302885.5592040396   10779                    28754
Latitude    Float    86294096.37312101   -54.933333     82.483333      1           12          27.188165808468785  21.95261384912504   32.4972221  51.15        1038349
Longitude   Float    117718483.57958724  -179.9833333   180            1           14          37.08885989656418   63.223010459241635  35.28       23.8         1167162

real    0m7.890s
user    0m15.361s
sys 0m1.475s

Of course the summarize command is only doing that, whereas fread will bring the entire dataset into memory before summarizing, so it still feels like fread has the advantage.

More potential benchmarks here. Might be useful to run all these commands and get the total time for xsv vs data.table to help illustrate the advantage of getting the object in memory.

Anyway it looks like a nice tool for poking around CSVs on the command line.

HughParsonage commented 6 years ago

Nice find! Ostensibly a slightly different domain: xsv appears to be more aimed at one or two queries on a fresh csv where the 'bottleneck' of fread is enough to make data.table slower. But the gap appears to close pretty damn fast and so it would seem like data.table should dominate in almost all use-cases.

jangorecki commented 6 years ago

Loading data to R requires to populate R global string cache which AFAIK is single threaded, thus it will be relatively easy for non-R tools to be faster than fread. Fair comparison in this case would be versus C fread, without involving R fread. Anyway hopefully we will address that (populating R global string cache) in future.

st-pasha commented 6 years ago

In python datatable, the file was read in 0.54s; reading+summarizing took 4.9s

MichaelChirico commented 6 years ago

Excellent... maybe worth a short blog post then (could also be used to show off pydatatable, as well as the new-ish benchmarking vignette)... if I find time soon I'll at least outline one 👍