Closed HenrikBengtsson closed 1 year ago
Four packages break when switching to useNames = FALSE
; one package is on CRAN and three on Bioconductor, cf. https://github.com/HenrikBengtsson/matrixStats/blob/dbd47cd722b42c538946173945740c8a8b9b130c/revdep/README.md.
CRAN:
useNames = TRUE
for two rowQuantiles()
)Bioconductor:
Importantly, I have not run recursive revdep checks on the above Bioconductor package, so more things might break there, when those packages are update too.
EDIT 2022-11-18: Dropped packages Pigengene and SCFA, which both were false positives here.
These should be okay to change in the BioC MatrixGenerics/sparseMatrixStats/DelayedMatrixStats stack because we try to mimic the matrixStats API as closely as possible. Do you have a rough ETA on this change? Only, I'm very busy for the rest of this month and unlikely to have time for any BioC package development/maintainence.
Thanks.
Regarding ETA: matrixStats 0.63.0 first needs to be on CRAN, then I'll make this move. FWIW, matrixStats 0.63.0 is currently in the CRAN incoming queue (unusually long this week), but I hope it's rolling out with 3-5 days.
Then, I need to get GenEst to update their package on CRAN too, before I can submit the useNames = FALSE
version. If they don't respond within 2-3 weeks, I'll go ahead a push to CRAN anyway.
So, I think it'll be 3-4 weeks before you see the useNames = FALSE
version, but I hope to get it out no later than mid-December.
@PeteHaitch, I'm just updated the develop branch to use useNames = FALSE
by default. Hopefully, this helps you test towards this upcoming version. To install it, use:
remotes::install_github("HenrikBengtsson/matrixStats", ref = "develop")
BTW, I also deprecated useNames = NA
, so it gives a deprecatedWarning
. Don't think that'll affect anything on your end. The goal is to defunct useNames = NA
soon after the useNames = FALSE
release is out.
PS. matrixStats 0.63.0 is now on CRAN, but that was just a minor release (NEWS), which shouldn't affect you at all.
Hi Henrik,
Just following our twitter thread: https://twitter.com/henrikbengtsson/status/1593674832448614400?s=20&t=i3VT8eg5IYydevMC8gFfww
In spectrometry one typically can have a matrix with one spectra per row. The "mean spectra" is computed with colMeans(). Even if I have used R for more than a decade, I still doubt if rowMeans()
returns a vector with "the mean of each row" or a vector with "the average row". I find embarrassing to admit that I often have to check it.
I've seen hidden bugs due to confusing rowMeans()
with colMeans()
or other similar row/col functions. Some are mine, some are in other people's code, possibly due to that same confusion I have. Those bugs would have been avoided if proper names had been set (and preserved) in the calculations. That's why I usually tell people to name things and check the names. Especially when teaching/learning, having good names by default helps students a lot.
I understand the performance impact of keeping track of names everywhere and I understand the goals of matrixStats. I'd say getting good performance comes after getting the right calculations, and it's easier to get things right first and then ignore names to make it faster, than get things fast but wrong and not having names to figure out why.
Knowing your HPC work, I fully understand that performance is a priority to you and to this package. I wish R was faster in keeping track of names so both goals would not conflict. So I will keep telling everyone to useNames=TRUE
and explain its performance impact, as I once told them to use stringsAsFactors=FALSE
, and I will wait for the day the performance impact is so low that useNames=FALSE
doesn't pay off. (Submitting a patch to R fixing the cost would be great but I guess it's not that easy or it would have already been done).
Thanks for all the discussion and explanation,
Sergio
@HenrikBengtsson I'm a bit confused between #226 and #227.
Is the plan for the next matrixStats release to make useNames = FALSE
or useNames = TRUE
for all functions?
Do you have an ETA for when you'd like to make that release? FWIW I'm on summer break until January 10 and will be away from a computer most of that time.
@HenrikBengtsson I'm a bit confused between #226 and #227. Is the plan for the next matrixStats release to make
useNames = FALSE
oruseNames = TRUE
for all functions?
Yeah, you're in the rights to be confused. I created Issue #227 (useNames = TRUE
) to see what would happen if that became the default, i.e. what packages would break with that default. Regardless what the default would be, it's worth reaching out to their maintainers to set useNames = FALSE
(sic!) where they make that assumption. This would make their code more robust and remove part of the equation for us. I made a call for help to reach out to them there. So, I'm still procrastinating on the final decision, but want to clear the road ahead when the final decision is made.
What would be really useful is to have some serious benchmarking arguments from both sides, e.g.
library(matrixStats)
Xs <- lapply(1:10000, FUN = function(ii) {
X <- matrix(rnorm(n = 16L), nrow = 4L, ncol = 4L)
rownames(X) <- sample(letters, size = 4L)
colnames(X) <- sample(LETTERS, size = 4L)
X
})
stats <- bench::mark(lapply(Xs, FUN = rowSums2, useNames = FALSE), lapply(Xs, FUN = rowSums2, useNames = TRUE), check = FALSE, min_iterations = 100L)
print(stats)
## # A tibble: 2 × 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time ## result memory time gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
## 1 lapply(Xs, FUN = rowSums2, useNames = FALSE) 32.9ms 34.2ms 28.8 78.2KB 25.5 53 47 1.84s <NULL> <Rprofmem [26 × 3]> <bench_tm [100]> <tibble>
## 2 lapply(Xs, FUN = rowSums2, useNames = TRUE) 35.9ms 39.2ms 23.6 78.2KB 25.6 48 52 2.03s <NULL> <Rprofmem [102 × 3]> <bench_tm [100]> <tibble>
The more such use cases, the better. This would provide things to discuss.
Do you have an ETA for when you'd like to make that release? FWIW I'm on summer break until January 10 and will be away from a computer most of that time.
I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.
I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.
Thanks, Henrik, I hope you have an enjoyable break over the Christmas and New Year.
Hey, I created a new branch in sparseMatrixStats
and MatrixGenerics
called 'change_useNames_default', which implement useNames = FALSE
. It is trivial to change the default using a simple search and replace.
So if you decide to go ahead with the change, we can just merge the branch to avoid disruption to the end users :)
Closing. See Issue #232 .
I'm going to make a decision on what the default for
useNames
will be. Based on the newly added Design Goals:I'll make the new default
useNames = FALSE
.Actions
useNames = FALSE
the default everywhereuseNames = NA
useNames = FALSE
on all 420+ CRAN and Bioconductor packages to identify which packages breaks and if so, which functions are involveduseNames = TRUE
, where needed