csgillespie / efficientR

Efficient R programming: a book
https://csgillespie.github.io/efficientR/
Other
720 stars 373 forks source link

add arrow/parquet as alternative in Chapter 5 #292

Open engineerchange opened 3 years ago

engineerchange commented 3 years ago

Saw a debate on using feather vs. arrow's parquet online, and it seems like it is a viable alternate in efficiency and worth benchmarking in Chapter 5: Efficient input/output.

https://ursalabs.org/blog/2019-10-columnar-perf/

Robinlovelace commented 3 years ago

Hi @engineerchange, first I'd like to say: many thanks for keeping this repo lively, your agitating for more computationally efficient implementations is greatly appreciated from the perspective of updating the book (and perhaps from the perspective of engineering positive change in the world beyond computing)!

Have you seen any benchmarks comparing parquet vs vroom, and do you know if the R implementation, which seems slower than the Python implementation in Wes's tests, has sped up?

For me the only question is 'when' not 'if': has the R implementation reached a sufficient level of maturity to be worthy of inclusion in the book? On a related note I'd like to add duckdb to the book, assuming it's ready.

engineerchange commented 3 years ago

That's a very kind note from you - I use your book as a cheatsheet like most, so I am happy to hear that my agitations are appreciated! 😅

I don't have many answers here except to suggest it as an option. I struggle to know when a package has reached a level of maturity that would be appropriate for a publication like this. I think this effort would be a good way to document some benchmarks, however.

Yeah, duckdb is quickly moving into ⭐ status in the R world, and including it is probably a good idea. I was poking around with it a bit this weekend; and I think it's likely the best way to introduce SQL to an R user, and likely to someone brand new to coding.

Robinlovelace commented 3 years ago

Fantastic. Well... in the interests of keeping our giant 'cheat sheet' up-to-date, any further comments and especially suggested changes via PRs, are very welcome ;)

Robinlovelace commented 3 years ago

Heads-up @engineerchange I've created this PR that aims to compare vroom and arrow options: https://github.com/csgillespie/efficientR/pull/293

Work in progress, comments on or additions to that welcome!