BoulderCodeHub / RWDataPlyr

R package to read and manipulate data from RiverWareTM
3 stars 5 forks source link

Possible to speed up read.rdf #28

Closed rabutler closed 7 years ago

rabutler commented 8 years ago

Would using fread or similar speed up the reading of the rdf file?

rabutler commented 8 years ago

Reading in an entire rdf file (153 MB) resulted in 22,226,721 elements and 170.4 Mb in R.

Using readLines this takes 13.32 seconds (12.44 user, 0.46 system).

Using data.table::fread('file.rdf', sep = '\t') this takes 2.4 seconds (2.28 user, 0.09 system).

rabutler commented 8 years ago

read.rdf takes 38.44 s (37/1.36) on the same file. So likely we could reduce this from 38.44 to 27.52 s by switching to fread.

rabutler commented 8 years ago

Commit 45e4e56e0de50500ccf6844c6bdda1ef4766118a started to address this, with very minor improvements for large files, and slower reads for small files. read.rdf2 uses data.table::fread.

For a 156 MB file:

User System Elapsed
read.rdf 36.24 1.31 38.75
read.rdf2 37.25 0.11 37.66

For a 0.9 MB file:

User System Elapsed
read.rdf 0.51 0.03 0.55
read.rdf2 0.72 0.00 0.72
rabutler commented 8 years ago

b01228817f0e1c28b395b4b9d4a08ef34931e314 converted from data frame to matrix before parsing everything. The comparisons are now:

User System Elapsed
read.rdf 36.24 1.31 38.75
read.rdf2 26.14 0.82 28.98

For a 0.9 MB file:

User System Elapsed
read.rdf 0.51 0.03 0.55
read.rdf2 0.44 0.00 0.44
rabutler commented 7 years ago

Don't think there are anymore obvious enhancements to speed it up at this point.