Bioconductor / MatrixGenerics

S4 Generic Summary Statistic Functions that Operate on Matrix-Like Objects
https://bioconductor.org/packages/MatrixGenerics
12 stars 1 forks source link

'useNames = TRUE' is the new default in matrixStats v1.0.0 #31

Closed const-ae closed 1 year ago

const-ae commented 1 year ago

matrixStats v1.0.0 was just published on CRAN (https://github.com/HenrikBengtsson/matrixStats/issues/232). The main change is that the old default useNames = NA is now deprecated and the new default useNames = TRUE.

I wonder if we should patch the current release version to use useNames = FALSE because that avoids deprecation warnings and won't change the behavior all of a sudden. And then with the next Bioconductor release change the default to useNames = TRUE

hpages commented 1 year ago

the new default useNames = TRUE

The new behaviour is more inline with colSums() and colMeans() propagating the names in base R. Never understood the purpose of the useNames=NA thing...

I wonder if we should patch the current release version to use useNames = FALSE ...

I guess that would work.

I'd be more confortable though if we didn't touch useNames=NA in the generics and only made the following simple change in the methods:

.matrixStats_rowVars <- function(x, rows=NULL, cols=NULL, na.rm=FALSE, center=NULL, dim.=dim(x), ...,
                                    useNames=NA)  # <-- keep this
{
  matrixStats::rowVars(x, rows=rows, cols=cols, na.rm=na.rm, center=center, dim.=dim., ...,
                          useNames=isTRUE(useNames))  # <-- proposed change
}

That way the argument list of the generics doesn't change and remains in sync with the argument list of all the methods in sparseMatrixStats, SparseArray, DelayedArray, and DelayedMatrixStats.

H.

const-ae commented 1 year ago

I'd be more confortable though if we didn't touch useNames=NA in the generics and only made the following simple change in the methods: matrixStats::rowVars(x, rows=rows, cols=cols, na.rm=na.rm, center=center, dim.=dim., ..., useNames=isTRUE(useNames)) # <-- proposed change

I like that idea.


I have already started to implement the new default in sparseMatrixStats (https://github.com/const-ae/sparseMatrixStats/commit/5441917d2e8e158e86ff8dbb75f58d061440b2bc), but then I would only merge this in the devel branch.

HenrikBengtsson commented 1 year ago

Never understood the purpose of the useNames=NA thing...

In the beginning, matrixStats didn't control for names attributes, so we got whatever the implementation happened to output (yet there were packages depending on it). In 2021, we ran a GSoC project for adding support for useNames = FALSE and useNames = TRUE. We decided to keep the old, undefined behavior via useNames = NA until we settled on what the new default should be, which has now been done. This will allow us to defunct useNames = NA in a near future.

There is no need for you to keep supporting useNames = NA. Actually, the sooner you can drop support for that, the better; it'll probably help speeding up its removal from matrixStats too.

hpages commented 1 year ago

I see. That means that in order to prevent disruption in release, we can't just blindly replace useNames=useNames with useNames=isTRUE(useNames) everywhere like I was proposing we do for matrixStats::rowVars above. We'll need to figure out what the method was doing for useNames=NA, and replace with useNames=isTRUE(useNames) if it was not propagating the names, and with useNames=!isFALSE(useNames) if it was propagating the names.

hpages commented 1 year ago

@const-ae I will give this a try unless you already started to work on it?

const-ae commented 1 year ago

Go ahead, I am currently just barely staying afloat as I am finishing up my PhD thesis 😅

hpages commented 1 year ago

Exciting times! I suppose you don't need this kind of distraction to keep you entertained :wink:

hpages commented 1 year ago

ok so it looks like most functions did NOT propagate the names in matrixStats 0.63.0

Keeping track of this here for reference:

### With matrixStats 0.63.0 (useNames=NA)

library(matrixStats)
stopifnot(packageVersion("matrixStats") == '0.63.0')

# 36 row* + 36 col* generics in MatrixGenerics.
# All of them have the 'useNames' arg, except rowAvgsPerColSet() and
# colAvgsPerRowSet().
# Also MatrixGenerics does not define generics for the *AnyMissings() functions.
all_matrixStats_symbols <- ls('package:matrixStats')
rowfuns <- setdiff(grep('^row', all_matrixStats_symbols, value=TRUE),
                   c("rowAvgsPerColSet", "rowAnyMissings"))
colfuns <- setdiff(grep('^col', all_matrixStats_symbols, value=TRUE),
                   c("colAvgsPerRowSet", "colAnyMissings"))

### Figuring out whether they propagate the names or not:

m <- matrix(11:22, ncol=3, dimnames=list(letters[1:4], LETTERS[1:3]))

rowfun_propagates_the_names <- function(fun, useNames=NA)
{
    FUN <- match.fun(fun)
    if (fun == "rowCollapse") {
        idxs <- sample(ncol(m), nrow(m), replace=TRUE)
        res <- FUN(m, idxs=idxs, useNames=useNames)
    } else if (fun == "rowOrderStats") {
        res <- FUN(m, which=1L, useNames=useNames)
    } else {
        res <- FUN(m, useNames=useNames)
    }
    names_on_res <- if (is.matrix(res)) rownames(res) else names(res)
    identical(names_on_res, rownames(m))
}

colfun_propagates_the_names <- function(fun, useNames=NA)
{   
    FUN <- match.fun(fun)
    if (fun == "colCollapse") { 
        idxs <- sample(nrow(m), ncol(m), replace=TRUE)
        res <- FUN(m, idxs=idxs, useNames=useNames)
    } else if (fun == "colOrderStats") {
        res <- FUN(m, which=1L, useNames=useNames)
    } else {
        res <- FUN(m, useNames=useNames)
    }
    if (fun %in% c("colQuantiles", "colRanges", "colRanks", "colTabulates")) {
        names_on_res <- rownames(res)
    } else {
        names_on_res <- if (is.matrix(res)) colnames(res) else names(res)
    }
    identical(names_on_res, colnames(m))
}

yes_or_no <- sapply(rowfuns, rowfun_propagates_the_names)
table(yes_or_no)
# FALSE  TRUE 
#    25    10 

### row* functions that propagate the names in matrixStats 0.63.0:
rowfuns[yes_or_no]
# [1] "rowIQRDiffs"      "rowLogSumExps"    "rowMadDiffs"      "rowQuantiles"    
# [5] "rowSdDiffs"       "rowVarDiffs"      "rowWeightedMads"  "rowWeightedMeans"
# [9] "rowWeightedSds"   "rowWeightedVars" 

yes_or_no <- sapply(colfuns, colfun_propagates_the_names)
table(yes_or_no)
# FALSE  TRUE 
#    25    10

### col* functions that propagate the names in matrixStats 0.63.0:
colfuns[yes_or_no]
# [1] "colIQRDiffs"      "colLogSumExps"    "colMadDiffs"      "colQuantiles"    
# [5] "colSdDiffs"       "colVarDiffs"      "colWeightedMads"  "colWeightedMeans"
# [9] "colWeightedSds"   "colWeightedVars" 
hpages commented 1 year ago

@PeteHaitch @const-ae If you guys want to take a look at this before I merge: #32 This is a PR against the RELEASE_3_17 branch.

I'll start working on a PR against the devel branch today. If you're ok with this, my plan is to just switch the useNames default used in the argument lists of the generics from NA to TRUE, like in matrixStats. These changes will actually be simpler than the changes in release. However, unlike the latter, they won't be backward compatible.

hpages commented 1 year ago

And here is the proposed changes for the devel version: #33

PeteHaitch commented 1 year ago

MatrixGenerics (along with DelayedMatrixStats and sparseMatrixStats) has now been updated in both the release (3.17) and devel (3.18) branches of Bioconductor to handle this change.