Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

`data.table` principles #5693

Closed TysonStanley closed 10 months ago

TysonStanley commented 11 months ago

As part of #5676 we would also like to compile a list of principles tied to data.table. This will be incorporated into other material over the next few months but wanted to see what you all thought about the list we have initially put together.

Anything you'd add to this list? Anything you'd argue does not belong on the list?

Thanks!

jangorecki commented 11 months ago

Low memory usage.

  1. I am not sure, we provide only Chinese, so I don't think the point fits very well
TysonStanley commented 11 months ago

Thanks, @jangorecki . I guess 6 was more of a goal than a current practice? I really like the addition of low memory usage. Will include that.

jangorecki commented 11 months ago

I don't think we want to maintain multiple languages. Chinese, Russian, Spanish/Portuguese is reasonable maximum IMO.

DavidArenburg commented 11 months ago

Good list re 5, I think you could rephrase to "backward compatibility" re 6, what does it mean? Like documentation wise? If so, who is going to maintain that?

jangorecki commented 11 months ago

@DavidArenburg only error/warning/verbose. We have them translated to Chinese for Chinese locale in user session. Point 4 is about backward compatibility (in our api). Point 5 extends it for running on old R version.

jangorecki commented 11 months ago

I would possibly add to the list a comprehensive documentation. I haven't seen a package documented better than DT actually. Many just make minimal manual and put more info to vignettes, which is indirect documentation when it comes to description of a function, it's return value, etc. Vignettes should be an accompanying documentation, not the main.

tdhock commented 11 months ago

About international/multilingual/translations, it is true that only Chinese is supported in current message translations. Going forward in the next two years, I plan to invite more translators (of messages and docs), and I actually have money to pay them (20 translation projects, US$500 each). I expect that whoever contributes the intitial translation may be interested to maintain in the future. The goal of the translation effort is to increase the number of potential users and contributors in the data.table ecosystem.

tdhock commented 11 months ago

I wonder if you could please clarify point 4? Maybe change "Few breaking changes" to "Few breaking changes, to make it easy for other packages to use data.table" Is that what you meant?

TysonStanley commented 11 months ago

By point 4, yes, in my experience DT was always very careful about any releases that would have breaking changes requiring changes to other packages/code bases. There could be a better way of phrasing it but that was the idea behind it.

TysonStanley commented 11 months ago

I would possibly add to the list a comprehensive documentation. I haven't seen a package documented better than DT actually. Many just make minimal manual and put more info to vignettes, which is indirect documentation when it comes to description of a function, it's return value, etc. Vignettes should be an accompanying documentation, not the main.

@jangorecki I agree. I'll add comprehensive documentation to the list as number 8.

markseeto commented 11 months ago

Maybe consider including something about readability/useability. This could be its own principle, or part of principle 3, e.g. "Concise syntax (minimal redundancy in code), while maintaining readability and ease of use".

The reason I suggest this is that data.table seems to have a reputation for being fast but relatively difficult to learn and use. I sometimes see comments like (paraphrasing) "tidyverse is fantastic, and in situations where speed is really important, there's data.table", as though the only advantage of data.table is its speed.

Maybe also consider adding something about extensive functionality, unless this goes without saying.

jangorecki commented 11 months ago

relatively difficult to learn and use

It nails down to from where you as a user are coming from. If you are a psychologists just doing some stats then I can imagine you may find it hard. If you are coming from data analytics (databases, SQL), maths relational algebra or engineering, you are likely to find it not only easy, but much easier than anything that exists in R (for data.frame), and way more superior to those you are coming from. My career shifted from data warehouses to R exactly because of that.

I understand your point well, and am observing the same. It just if we want to counter some judgments, marketed at some point by a new project that was targeting less technical audience, about data.table syntax then we could try to make it very precisely. @arunsrinivasan made a nice comment on syntax in his SO answer here: https://stackoverflow.com/a/27718317/2490497

Maybe also consider adding something about extensive functionality, unless this goes without saying.

Another good point.

TysonStanley commented 11 months ago

@markseeto for the extensive functionality, I think that makes a lot of sense. As I think about it, there is definitely some overlap with concise syntax as there are a bunch of things that can be done without going away from the DT[i, j, by] syntax (e.g., any data frame operation, grouping functions, aggregation, joins, etc.). Is there a way to communicate that concisely in the list? Something like "Extensive functionality with minimal need for additional functions" or something?

@jangorecki thanks for that link. Feel like that answer should be turned into a blog post or something too. So much gold in that. Also, I think your point of it naturally fitting with SQL (relational databases) is one of its immediate strengths in learning the code. Was wondering if there is a principle there, potentially? Like "syntactic overlap with data analytics, engineering, and mathematics" or something like that?

markseeto commented 11 months ago

@TysonStanley For "extensive functionality", what I'm thinking of is separate from concise syntax, although I agree that there is some overlap. I'm thinking of the ability to do an extensive range of useful operations with the data, whether that's with DT[i, j, by] syntax or with functions like dcast, groupingsets, etc. But maybe this isn't really a "principle" like the principles you've listed.

MichaelChirico commented 11 months ago

I don't think we want to maintain multiple languages. Chinese, Russian, Spanish/Portuguese is reasonable maximum IMO.

I think this which I prepared for core R is a useful reference: Table 'Languages with R Translations' from https://docs.google.com/document/d/1XbfOf3CLVb2UFyUZGJoVLkBUDZ6Hs3APCDW8UzuOvZk/edit

A list which includes Russian/Spanish should include at least Arabic and a South Asian language (e.g. Hindi).

Anyway, agree there is some maintenance overhead, but tooling changes can reduce that overhead. Rather than set an "arbitrary" limit, I'd rather the maintainers decide incrementally (1/2 languages at a time) whether to accept new translations.

For now, my bigger concern has been package size. The checked-in .mo binaries are about .22MiB per language, and the plain-text .po files are about .26MiB -- precious storage given we're always bumping up against the limit to generate a CRAN note. There is some initial discussion with R core about generating .mo at build time, but that's probably a way's off still.

BTW, in the initial quest for Chinese translations, I made sure to make note of other community members offering translations in other languages, those are: Vietnamese, French, Russian, Portugese, Farsi, Turkish, Hindi. That's already 4 years ago, so of course would need to check their interest again.

MichaelChirico commented 11 months ago

Comprehensive documentation

I would say 'Comprehensive and accessible documentation'. I think we strive to have both technically complete, but also user-friendly Rd/vignettes and error/warning messages and NEWS entries

MichaelChirico commented 11 months ago

Is the list meant to be numbered? i.e. are these principles ranked? If so, putting computational & memory efficiency in the same bullet makes sense to me.

TysonStanley commented 11 months ago

@MichaelChirico thanks! It's not ranked necessarily so I made it bullets instead. And updated it with your suggestions.

tdhock commented 11 months ago

I feel like international/multilingual bullet point could be deleted, since that is the "accessible" part of "Comprehensive and accessible documentation" ?

tdhock commented 11 months ago

I think it would clarify/simplify to combine "Few breaking changes" with "Backward compatibility" since they both are about stability of the code. How about "Stable code base (easy for users to upgrade to new data.table, and compatible with old R versions)" or clarify each item? "Few breaking changes (easy for users to upgrade to new data.table versions)" "Compatible with old versions of base R"

tdhock commented 11 months ago

@MichaelChirico "I made sure to make note of other community members offering translations in other languages, those are: Vietnamese, French, Russian, Portugese, Farsi, Turkish, Hindi. That's already 4 years ago, so of course would need to check their interest again." -> could you please send me their contact info, so I can ask if they would be interested to apply for translation project awards?

MichaelChirico commented 11 months ago

I shared the Google doc with you. It was mostly twitter replies.

On Thu, Oct 12, 2023, 11:11 AM Toby Dylan Hocking @.***> wrote:

@MichaelChirico https://github.com/MichaelChirico "I made sure to make note of other community members offering translations in other languages, those are: Vietnamese, French, Russian, Portugese, Farsi, Turkish, Hindi. That's already 4 years ago, so of course would need to check their interest again." -> could you please send me their contact info, so I can ask if they would be interested to apply for translation project awards?

— Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/5693#issuecomment-1760126054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2BA5J7Q7NKOCMUPYNK223X7AXE5ANCNFSM6AAAAAA5KCMN54 . You are receiving this because you were mentioned.Message ID: @.***>

stefanfritsch commented 11 months ago

I'd definitely add clear error messages that provide underlying causes, explanations and possible solutions.

I.e. not NA where TRUE/FALSE needed but "it seems you didn't specify x but we need it because of y. Try z if unsure."

Your errors have helped me convert a few users.

leofontenelle commented 11 months ago

About international/multilingual/translations, it is true that only Chinese is supported in current message translations. Going forward in the next two years, I plan to invite more translators (of messages and docs), and I actually have money to pay them (20 translation projects, US$500 each). I expect that whoever contributes the initial translation may be interested to maintain in the future. The goal of the translation effort is to increase the number of potential users and contributors in the data.table ecosystem.

If Brazilian Portuguese is to be one of the languages, please contact me. I used to translate GNOME to pt_BR and even coordinated the national i10n team until I decided to focus on activities closer to my profession (medicine), which eventually came to mean doing research, which is how I know data.table. I'm not necessarily offering myself (although the money is tempting) but I can find one or another competent free software translator here and help them as needed.

edit: now I see someone else volunteered already, so I guess they should probably be the first choice

MichaelChirico commented 11 months ago

edit: now I see someone else volunteered already, so I guess they should probably be the first choice

FWIW Mandarin took a team of 26 translators -- it's a rather sizeable pool of messages to translate, so having >1 hand available will be appreciated.