HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
202 stars 33 forks source link

Packages breaking if useNames = FALSE becomes the new default #226

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

I'm going to make a decision on what the default for useNames will be. Based on the newly added Design Goals:

The objectives of the matrixStats package is to perform operations on matrices (i) as fast as possible, while (ii) not using unnecessary amounts of memory. These objectives drive the design, including the choice of the different defaults.

I'll make the new default useNames = FALSE.

Actions

HenrikBengtsson commented 1 year ago

Revdep results

Four packages break when switching to useNames = FALSE; one package is on CRAN and three on Bioconductor, cf. https://github.com/HenrikBengtsson/matrixStats/blob/dbd47cd722b42c538946173945740c8a8b9b130c/revdep/README.md.

CRAN:

Bioconductor:

Importantly, I have not run recursive revdep checks on the above Bioconductor package, so more things might break there, when those packages are update too.

EDIT 2022-11-18: Dropped packages Pigengene and SCFA, which both were false positives here.

PeteHaitch commented 1 year ago

These should be okay to change in the BioC MatrixGenerics/sparseMatrixStats/DelayedMatrixStats stack because we try to mimic the matrixStats API as closely as possible. Do you have a rough ETA on this change? Only, I'm very busy for the rest of this month and unlikely to have time for any BioC package development/maintainence.

HenrikBengtsson commented 1 year ago

Thanks.

Regarding ETA: matrixStats 0.63.0 first needs to be on CRAN, then I'll make this move. FWIW, matrixStats 0.63.0 is currently in the CRAN incoming queue (unusually long this week), but I hope it's rolling out with 3-5 days.

Then, I need to get GenEst to update their package on CRAN too, before I can submit the useNames = FALSE version. If they don't respond within 2-3 weeks, I'll go ahead a push to CRAN anyway.

So, I think it'll be 3-4 weeks before you see the useNames = FALSE version, but I hope to get it out no later than mid-December.

HenrikBengtsson commented 1 year ago

@PeteHaitch, I'm just updated the develop branch to use useNames = FALSE by default. Hopefully, this helps you test towards this upcoming version. To install it, use:

remotes::install_github("HenrikBengtsson/matrixStats", ref = "develop")

BTW, I also deprecated useNames = NA, so it gives a deprecatedWarning. Don't think that'll affect anything on your end. The goal is to defunct useNames = NA soon after the useNames = FALSE release is out.

PS. matrixStats 0.63.0 is now on CRAN, but that was just a minor release (NEWS), which shouldn't affect you at all.

zeehio commented 1 year ago

Hi Henrik,

Just following our twitter thread: https://twitter.com/henrikbengtsson/status/1593674832448614400?s=20&t=i3VT8eg5IYydevMC8gFfww

In spectrometry one typically can have a matrix with one spectra per row. The "mean spectra" is computed with colMeans(). Even if I have used R for more than a decade, I still doubt if rowMeans() returns a vector with "the mean of each row" or a vector with "the average row". I find embarrassing to admit that I often have to check it.

I've seen hidden bugs due to confusing rowMeans() with colMeans() or other similar row/col functions. Some are mine, some are in other people's code, possibly due to that same confusion I have. Those bugs would have been avoided if proper names had been set (and preserved) in the calculations. That's why I usually tell people to name things and check the names. Especially when teaching/learning, having good names by default helps students a lot.

I understand the performance impact of keeping track of names everywhere and I understand the goals of matrixStats. I'd say getting good performance comes after getting the right calculations, and it's easier to get things right first and then ignore names to make it faster, than get things fast but wrong and not having names to figure out why.

Knowing your HPC work, I fully understand that performance is a priority to you and to this package. I wish R was faster in keeping track of names so both goals would not conflict. So I will keep telling everyone to useNames=TRUE and explain its performance impact, as I once told them to use stringsAsFactors=FALSE, and I will wait for the day the performance impact is so low that useNames=FALSE doesn't pay off. (Submitting a patch to R fixing the cost would be great but I guess it's not that easy or it would have already been done).

Thanks for all the discussion and explanation,

Sergio

PeteHaitch commented 1 year ago

@HenrikBengtsson I'm a bit confused between #226 and #227. Is the plan for the next matrixStats release to make useNames = FALSE or useNames = TRUE for all functions?

Do you have an ETA for when you'd like to make that release? FWIW I'm on summer break until January 10 and will be away from a computer most of that time.

HenrikBengtsson commented 1 year ago

@HenrikBengtsson I'm a bit confused between #226 and #227. Is the plan for the next matrixStats release to make useNames = FALSE or useNames = TRUE for all functions?

Yeah, you're in the rights to be confused. I created Issue #227 (useNames = TRUE) to see what would happen if that became the default, i.e. what packages would break with that default. Regardless what the default would be, it's worth reaching out to their maintainers to set useNames = FALSE (sic!) where they make that assumption. This would make their code more robust and remove part of the equation for us. I made a call for help to reach out to them there. So, I'm still procrastinating on the final decision, but want to clear the road ahead when the final decision is made.

What would be really useful is to have some serious benchmarking arguments from both sides, e.g.

library(matrixStats)
Xs <- lapply(1:10000, FUN = function(ii) {
  X <- matrix(rnorm(n = 16L), nrow = 4L, ncol = 4L)
  rownames(X) <- sample(letters, size = 4L)
  colnames(X) <- sample(LETTERS, size = 4L)
  X
})

stats <- bench::mark(lapply(Xs, FUN = rowSums2, useNames = FALSE), lapply(Xs, FUN = rowSums2, useNames = TRUE), check = FALSE, min_iterations = 100L)
print(stats)
## # A tibble: 2 × 13
##   expression                                        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time ## result memory               time             gc      
##   <bch:expr>                                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>           <list>  
## 1 lapply(Xs, FUN = rowSums2, useNames = FALSE)   32.9ms   34.2ms      28.8    78.2KB     25.5    53    47      1.84s <NULL> <Rprofmem [26 × 3]>  <bench_tm [100]> <tibble>
## 2 lapply(Xs, FUN = rowSums2, useNames = TRUE)    35.9ms   39.2ms      23.6    78.2KB     25.6    48    52      2.03s <NULL> <Rprofmem [102 × 3]> <bench_tm [100]> <tibble>

The more such use cases, the better. This would provide things to discuss.

Do you have an ETA for when you'd like to make that release? FWIW I'm on summer break until January 10 and will be away from a computer most of that time.

I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.

PeteHaitch commented 1 year ago

I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.

Thanks, Henrik, I hope you have an enjoyable break over the Christmas and New Year.

const-ae commented 1 year ago

Hey, I created a new branch in sparseMatrixStats and MatrixGenerics called 'change_useNames_default', which implement useNames = FALSE. It is trivial to change the default using a simple search and replace.

So if you decide to go ahead with the change, we can just merge the branch to avoid disruption to the end users :)

HenrikBengtsson commented 1 year ago

Closing. See Issue #232 .