Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.52k stars 967 forks source link

Building a data.table community of packages with “Seal of Approval” #5723

Closed TysonStanley closed 8 hours ago

TysonStanley commented 8 months ago

With the goal of building a community of packages that have similar philosophies and syntax that are separate from data.table (and outside of data.table scope #5722), we would like to set up a “Seal of Approval” (play on the mascots of data.table) process. The process for a package receiving the Seal of Approval could be: 

Approval will include being listed as a Seal of Approval package on the data.table repository and an SVG of the “seal” that they can include on their own repository/package logo. The initial idea would be packages that do at least one of the following:

Possible examples of this could include packages that have few dependencies (e.g. tinytest), extend functionality (e.g. dtplyr, tidytable, tidyfast), and packages that use data.table on the backend (e.g. modelsummary).   This process would hopefully help other developers feel more connected to data.table and be more likely to want to support it. Things for us to decide on are:

  1. Does this idea resonate with the data.table community?
  2. If so, does the list of criteria make sense for the purposes of the Seal of Approval?
  3. What else should be considered in designing the Seal of Approval?
jangorecki commented 8 months ago

I believe data.table was made to play nicely with any package, by following many conventions from base R. Making a "seal approved" may give an impression that some packages works better with data.table, while others don't work well or don't work at all with data.table...

Rather than having community of packages I would prefer to have all packages to be in a community.

TysonStanley commented 8 months ago

Thanks for your feedback on it. I agree that data.table is nicely designed to work well with all sorts of packages (and in ways that are not always obvious!). I don't think our intention would be to say there are certain packages that work best with data.table and others don't. The goal would be to help build the community around data.table. This was just one idea of how we could engage more R users and get them into the data.table repository more. We would also hope that it would spawn more ideas of how to use data.table with other packages, across more situations. The documentation (and other resources) are vast on data.table but I think there is still a lot of users that don't find it (and how to use it) early enough.

Do you have other suggestions on how to make entry into data.table use and development easier?

AngelFelizR commented 8 months ago

I have been using data.table as a user, but I know that many parts have been written in C. I have no clue where I could start to learn C to make meaningful contributions.

I would like to have some guidelines for novice users who want to contribute to data.table.

MichaelChirico commented 8 months ago

Is there anything besides an approval process {data.table} maintainers would be committing to as part of this?

Would the approval be granted in perpetuity / renewed regularly / granted with possible revocation under "certain circumstances" (which)?

jangorecki commented 8 months ago

I have been using data.table as a user, but I know that many parts have been written in C. I have no clue where I could start to learn C to make meaningful contributions.

I would like to have some guidelines for novice users who want to contribute to data.table.

I was on the same boat, reading code of PRs was quite useful, but what was game changer is to start coding, I started with rolling mean. Then I received great feedback in my PRs, mostly from Matt, so it was easy to pick up good practices. Often I draft naive version in R, to reflect how I will code it in C, I might even skip using functions like sum and code it as a for loop.

AngelFelizR commented 8 months ago

I have been using data.table as a user, but I know that many parts have been written in C. I have no clue where I could start to learn C to make meaningful contributions.

I would like to have some guidelines for novice users who want to contribute to data.table.

I was on the same boat, reading code of PRs was quite useful, but what was game changer is to start coding, I started with rolling mean. Then I received great feedback in my PRs, mostly from Matt, so it was easy to pick up good practices. Often I draft naive version in R, to reflect how I will code it in C, I might even skip using functions like sum and code it as a for loop.

What resource do you recommend to learn C and which function of data.table would be a great point to start?

jangorecki commented 8 months ago

Source of the project you are going to contribute to. And which function... The one that doesn't exist yet :)

MichaelChirico commented 8 months ago

@AngelFelizR The r-contributors slack (r-contributors.slack.com) hosted a book club on learning C for R users last year:

https://github.com/r-devel/c-book-club/

I believe there are videos still available; try asking in the #book-club-modern-c channel there. Otherwise expressing new interest is a way to get the book club running a second time (others have already inquired).


As for data.table's own C code, I think the most straightforward stuff would be:

I quite like recent improvements to GitHub's in-browser code-reading experience BTW, you can click through on function calls to find their definition / where symbols are defined / hover-over for their types.


Lastly, keep in mind that there's a ton of R code in data.table to improve as well! Over 8,000 lines already.

AngelFelizR commented 8 months ago

@MichaelChirico, thank you for your advice. I hope to contribute C code in the long term to continue progressing this amazing project.

I want to be prepared to the point where we can task moving data.table to work with data on disk.

I am here because the data.table survey asked if I wanted to contribute, so I started reading the issues.

tdhock commented 8 months ago

another way to contribute, even without knowing much about how data.table works (in C or otherwise), is to look at the open issues, and try to see if you can reproduce a bug report, then add a comment on the issue that explains what you did and whether or not the issue is reproducible. (and if you can make a simpler example than what is reported, that is even better)

tdhock commented 3 months ago

To make this a concrete proposal:

  1. we will add a section in README.md entitled "Seal of Approval" with a brief statement explaining that this is a list of packages which are built using data.table, etc.
  2. the approval process is the same as for anything. If you want to list your package under Seal of Approval, submit a PR that changes README.md (add your package to the list). Make sure the change has a link to your package web page, and also a brief description of how data.table is used, and what new/interesting/unique functionality your package provides.
  3. The data.table maintainers will judge submissions based on relevance -- does this package provide a new/interesting/unique functionality beyond what is provided by data.table? Also I believe Seal of Approval packages should be outside of data.table scope. A data.table maintainer will merge the PR if there is consensus, just like for any other PR.

For example I have been developing https://cran.r-project.org/package=nc which provides named capture regex functionality, using and outputting data.tables, and I would like that package for inclusion under Seal of Approval.

Another example would be the mlr3 packages which are built using data.table.

I see the Seal of Approval as a way of building community, by increasing awareness about how widely-used data.table is among other R packages.

TysonStanley commented 3 months ago

I think this will ultimately be a pretty low lift while allowing more public connections to the community.

tdhock commented 3 months ago

other packages to consider: https://cran.r-project.org/package=maditr https://cran.r-project.org/package=getDTeval

tdhock commented 3 months ago

glad to see some positive feedback to my proposal. also would be cool to have some logo with a sea lion giving a thumbs up, does anybody have graphics/art skills? @Maradestefanis ? My vision is that the README.md should have a one-line mention of the package, with a link to a blog post on https://rdatatable-community.github.io/The-Raft/ which gives further details. So that would entail a little extra work for the package author: writing that blog post. (But no extra work for data.table devs, who just review the PR with a change to README.md)

MaraDestefanis commented 3 months ago

Hey @tdhock I'm stepping in and I can give the logo a shot. It would be great to have it in high definition, if possible. Can you provide that? Also, is there anything else you need from me for the blog?

tdhock commented 3 months ago

Hi Mara the existing logo graphics files are in https://github.com/Rdatatable/data.table/tree/master/.graphics, is that high enough definition?

MaraDestefanis commented 3 months ago

Yes, awesome! I've been trying it out these days

Mara Destefanis Lic. Comunicación Social. Máster Ciencia de Datos Tel: (+598) 99041531 Ln: https://www.linkedin.com/in/maradestefanis/ web: https://www.maradestefanis.com/

El mar, 2 abr 2024 a las 18:01, Toby Dylan Hocking (< @.***>) escribió:

Hi Mara the existing logo graphics files are in https://github.com/Rdatatable/data.table/tree/master/.graphics, is that high enough definition?

— Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/5723#issuecomment-2033091087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXVQW37LM56VBL3QZCXHDFTY3MMBZAVCNFSM6AAAAAA64XBCPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZTGA4TCMBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

MaraDestefanis commented 2 months ago

Toby, Iam working on this graphics a sea lion giving a thumbs up, for now it is regular and not really nice the result. I will try again and send you the result in a few days.

I´ll keep pushing forward

Mara Destefanis Lic. Comunicación Social. Máster Ciencia de Datos Tel: (+598) 99041531 Ln: https://www.linkedin.com/in/maradestefanis/ web: https://www.maradestefanis.com/

El mié, 3 abr 2024 a las 6:51, Mara Destefanis @.***>) escribió:

Yes, awesome! I've been trying it out these days

Mara Destefanis Lic. Comunicación Social. Máster Ciencia de Datos Tel: (+598) 99041531 Ln: https://www.linkedin.com/in/maradestefanis/ web: https://www.maradestefanis.com/

El mar, 2 abr 2024 a las 18:01, Toby Dylan Hocking (< @.***>) escribió:

Hi Mara the existing logo graphics files are in https://github.com/Rdatatable/data.table/tree/master/.graphics, is that high enough definition?

— Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/5723#issuecomment-2033091087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXVQW37LM56VBL3QZCXHDFTY3MMBZAVCNFSM6AAAAAA64XBCPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZTGA4TCMBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

kbodwin commented 2 months ago

Hi all,

I'm reaching out with a couple ideas/options for the Seal of Approval process, to see if we can find one that everyone agrees on.

On this repository:

@jangorecki expressed some concern with a list on this repository's ReadMe, because it implies some kind of closed community, instead of data.table being accessible to anyone.

I propose that we simply add a Seal-of-Approval.md file in this repo that contains a simple list of packages that have gotten approval. Then, we can link to this md at the bottom the ReadMe, in the Community section, and reserve all additional details for blog posts on the raft instead of them clogging up this repo.

Approval process:

At first, @TysonStanley had suggested that approval was initialized with a PR to this repo, and @MichaelChirico was wondering about the expectations from maintainers for reviewing.

I want to suggest a reverse-order:

This would be kind of a mini "journal-style" process that would maybe take some of the burden off the maintainers.

Longevity:

Michael also asked about whether this approval is granted in perpetuity or not. I think just workload wise, we wouldn't commit to periodic re-reviews. However, if someone were to alert us to an issue with a package - say, it's no longer actively maintained - we'd take it off the list at maintainer's discretion.

Type of SoA Packages:

I've come up with four types of packages that might merit approval; in principle, a submitter would have to justify the package falling in one or more of these categories. I'd love feedback if anything seems amiss:


So, tl;dr, in this proposal:

Let me know if this sounds workable to you, or if you have other suggestions! :)

tdhock commented 3 weeks ago

Since there are no major blocking concerns with Kelly's most recent proposal, I would suggest that we go ahead with that.

TysonStanley commented 21 hours ago

Since the seal of approval is moving forward with Kelly's suggestion, should we close this now?