HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
203 stars 33 forks source link

Naming support #197

Closed yaccos closed 3 years ago

yaccos commented 3 years ago

As we dicussed yesterday at issue #190, you were worried that mimicing the behavior of base::rowSums, returning the rownames in the results would result in a performance overhead. I have now spent the last two days resolving the problem, and created an implementation touching only the C-code. At the time being, I have only implemented the feature for rowSums2 (and consequently for colSums2), but expanding the solution to the rest of the family should be straight-forward. I have tested the package and modified the tests to check whether the names are correct. I leave it to you to benchmark the modifications, but here is what I would expect to happen:

Also, I notice that the package installed size of the package exceeds the recommended limit of 10 MB and that some of the function factory macros are hard to read. You could migitate this problem by writing a custom version of validateIndecies (in the C version of course) which always coerse the data under ansNidxs to R_xlen_t in the case of SUBSETTED_INTEGER and SUBSETTED_REAL, leaving the subsettedType parameter superflous. For functions like rowSums2, the coersion to R_xlen_t must be done anyway. In the case of SUBSETTED_ALL, a NULL pointer is returned, making it easy to deal with this case separately.

yaccos commented 3 years ago

Sorry, I forgot to remove some debugging print statements in naming.c, so you know were the strange output comes from.

yaccos commented 3 years ago

Have you had time to review my solution yet? It was primarly meant as an inspiration to how the API could deal with naming in an efficient manner. Also if it is important to reduce overhead by calling R-code, I could advise that you replace dim. <- as.integer(dim.) and na.rm <- as.logical(na.rm) in the R-code with calls to coerceVector in the C-code.

HenrikBengtsson commented 3 years ago

Hi @yaccos, sorry for the radio silence. I've just got too many things on my plate and earth has this habit of doing one revolution every 24 hours - just too fast. Thank you so much for this proof-of-concept PR and for spending time on this problem.

I agree it would be useful if the matrixStats API could provide consistent handling of name attributes. I'm worried that rolling it out will disruptive and that's why this apparently "straightforward" task has been procrastinated for years now. I've spent a couple of days thinking about and writing up a plan for rolling it out safely; the gist is to roll it out with 100% backward compatibility to today's inconsistent behavior, while still making it available for users and developers to start making use of it. This way we can punt on the problem of switch the default behavior.

Coincidentally, I think it is a good candidate project for R-GSOC 2021, so I wrote up the plan with that in mind. See https://github.com/HenrikBengtsson/matrixStats/wiki/GSOC-2021-Proposal for how I think about approaching this problem. What do you think? If you think this is a good idea, let's have a chat offline (email or something).

yaccos commented 3 years ago

Yes, I understand very well that time and effort is a limited resource, especially when being an academic and not being payed to develop and maintain R packages. As popular and useful as the matrixStats package is, I consider it sad that there are lacking sufficient resources to maintain it. I therefore support your suggestion of including the task of naming consitency for R-GSOC 2021, and will send you an email in a few days to clearify the details on my views.

SebKrantz commented 3 years ago

Hello both of you, I‘d like to know the status of this. I‘d like to move forward to publish a package that implements this very soon (sooner than end of the summer or end of the year). I think this solution is pretty good, but if it does not live up to your demands Henrik or you don't have time to swiftly review it if I apply this to the rest of the functions I‘ll be happy to publish matrixStats2. From yaccos I‘l like to inquire if you have made any further progress on this or thought about publishing a package in the meantime that implements it. Best regards, Sebastian.

HenrikBengtsson commented 3 years ago

For the record: Support for useNames is now available in matrixStats (>= 0.60.0). This is part of a GSOC 2021 project. It is currently implemented in R and work on implementing this in C is on its way.