LOST-STATS / lost-stats.github.io

Source code for the Library of Statistical Techniques
https://lost-stats.github.io/
GNU General Public License v2.0
261 stars 165 forks source link

data.table variants #67

Closed NickCH-K closed 3 years ago

NickCH-K commented 3 years ago

Thinking of going through at least the data manipulation pages and adding data.table versions for all the R examples. Is this different enough to be worth it?

aeturrell commented 3 years ago

My (poor) understanding is that data.table provides 1) another, potentially faster, way to read data in and 2) a different syntax to dplyr (a quick look suggests it works more like Python’s pandas). Is that the gist or are there other major differences?

If it’s got broadly the same functionality as existing featured packages, then I think it comes down to how widely used it is likely to be. I know there’s a split between tidyverse and non-tidyverse adherents in R so it could be good to serve the non-tidyverse folks.

Talk of data.table has reminded of something else. (Although apologies for drifting off topic a bit here.)

A page on big, or large, data ingestion methods might be useful. Then it could have something on data.table, etc, for R, and using pandas & parquet, dask, etc., for Python.

And although they’re quite linked, as well as ingestion, it could be very useful to have a page on data storage. I see use of Excel, Stata .dta, or R data files, for cases where they really aren’t a good idea, eg long term storage and big data (!).

On 8 Nov 2020, at 07:59, NickCH-K notifications@github.com wrote:

 Thinking of going through at least the data manipulation pages and adding data.table versions for all the R examples. Is this different enough to be worth it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

NickCH-K commented 3 years ago

It's not just a different syntax but also much, much faster overall than dplyr, not just on reading in data but also manipulating it. Syntax differences too, but those are cosmetic of course - there's actually a dtplyr package that converts most dplyr code into data.table syntax.

As for usage, my understanding is that it's considerably less popular than dplyr overall, especially in academia, but is perhaps more popular than dplyr among the data science crowd, since they work with big data sets. Neither dplyr nor data.table are base-R though (although data.table is way closer). So it doesn't really resolve the non-tidyverse thing.

A page on big-data ingestion and storage methods would be great. There's already a page under Other on importing foreign files generally, but a different page that focuses on large data would be different enough to justify a different page I think.

aeturrell commented 3 years ago

I see, thanks for explaining (and I might check it out next time I'm using R!).

It's definitely not my call but if the main difference is speed of manipulation then my view would be that it might be better on a different page addressing data manipulation of large datasets (/computationally intensive methods) because that seems to be the main use case.

I say this as I think the typical user looking for info on data manipulation might find the extra detail covering special cases (here, scaling to bigger problems) a bit overwhelming. Essentially, I wonder if it could detract from the clarity of the pages.

And then one could have a page explicitly addressing scaling/speed/big data issues, which are not trivial to assess and explain. (Lazy execution versus on-the-fly and in-memory for a start.)

However, I'm not familiar with the packages in question and I can see it both ways so 100% supportive if you decide to go another way.

khwilson commented 3 years ago

R deeply confuses me on these points because I'm pretty sure that with the addition of data.table, we're up to 4 major competing syntaxes for manipulating data in R http://www.amelia.mn/Syntax-cheatsheet.pdf

There are many other distinctions about the internals of data.table or dask or d[tb]plyr or pandas or... but the original idea of this repository was to show you how to do the same things in many languages. In that way, the syntax issue probably overpowers the niceties of "when should I choose what" for this particular repository (though I look to the BDFL Nick's determination on that :-) ).

Perhaps the solution here is to actually lean into that conception of LOST? It may make sense, for instance, to setup every page as an L x E matrix where L is the number of languages LOST supports and E is the number of examples we have for a particular method? This also would make it pretty clear to newcomers what work there is to be accomplished. :-)

To Arthur's other point, having a "philosophy" page on "when you should use what" would I also think be great. It would help people answer the question, "Should I invest now in learning technology X instead of just hacking around with what I know?"

NickCH-K commented 3 years ago

It's more like three competing syntaxes - the formula syntax mentioned there is very rarely used for data manipulation, it's more for model creation (although it does sometimes pop up for stuff like reshaping wide/long). And two of those syntaxes - base (called "dollar sign syntax" there) and data.table are pretty similar in a lot of ways (and base is IMO pretty clearly the inferior of the three).

I like the idea of "when you should use what" as a page - it would also help set apart the examples here more effectively from something like StackExchange, which answers a question and provides some code, but usually isn't great at comparing different approaches (or languages!).

The other alternative is treating R as a special case and, for the purposes of data manipulation pages, basically treating R dplyr and R data.table as two separate languages. They nearly are!

grantmcdermott commented 3 years ago

Quick 5c:

I've been thinking of doing this for quite a while. So I'm certainly in favour. Nick we can divide and conquer if you'd like.

I agree tyranny of choice is a problem. But I think most R users would agree that dplyr (tidyverse) and data.table provide the canonical data manipulation methods. So having both makes sense to me. Yes, base R can do nearly everything too, but it's slower and more cumbersome to type.

On a less structured note, I'd like to see data.table included if for no other reason that I'm such a fan of it, personally. The syntax is not to everyone's liking (I don't see a problem), but it really is the fastest and most powerful game in town for an astonishing number of applications.

grantmcdermott commented 3 years ago

Oh, should also add: We're really lacking on the Julia examples atm. I'll try to add some when I get time (ha!).

grantmcdermott commented 3 years ago

I think we can close this now. I've add one example here and will be encouraging my students to submit data.table equivalents (i.e. as part of their OSS contribution requirement in my class).

aeturrell commented 3 years ago

In case you're interested, there's now also Py datatable, polars, and even cuDF in the Python alternatives-to-pandas ecosystem. (Not sure adding examples with these is high value for now though.)