HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
206 stars 34 forks source link

New 'useNames = TRUE' implementation distinguishes matrix without dimnames attribute and 'dimnames = list(NULL, NULL)' #234

Closed const-ae closed 1 year ago

const-ae commented 1 year ago

Continuing the discussion from #232.

With the new implementation of useNames = TRUE, matrixStats::colCumsums and other functions that return a matrix, treat a matrix without a dimnames attribute and dimnames = list(NULL, NULL) differently (v0.63.0-9008).

mat_names <- matrix(nrow = 5, ncol = 1, dimnames = list(NULL, NULL))
mat_no_names <- matrix(nrow = 5, ncol = 1)

# Matrices with NULL names and no names differ in R
str(mat_names)
#>  logi [1:5, 1] NA NA NA NA NA
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : NULL
str(mat_no_names)
#>  logi [1:5, 1] NA NA NA NA NA

str(matrixStats::colCumsums(mat_names))  
#>  int [1:5, 1] NA NA NA NA NA
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : NULL
str(matrixStats::colCumsums(mat_no_names))
#>  int [1:5, 1] NA NA NA NA NA

Created on 2023-05-30 with reprex v2.0.2

Previously, they were treated the same (v0.63.0):

mat_names <- matrix(nrow = 5, ncol = 1, dimnames = list(NULL, NULL))
mat_no_names <- matrix(nrow = 5, ncol = 1)

# Matrices with NULL names and no names differ in R
str(mat_names)
#>  logi [1:5, 1] NA NA NA NA NA
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : NULL
str(mat_no_names)
#>  logi [1:5, 1] NA NA NA NA NA

str(matrixStats::colCumsums(mat_names))  
#>  int [1:5, 1] NA NA NA NA NA
str(matrixStats::colCumsums(mat_no_names))
#>  int [1:5, 1] NA NA NA NA NA

Created on 2023-05-30 with reprex v2.0.2


From the sparse matrix point-of-view, the old behavior was preferable because dgCMatrix treats matrices without a dimnames attribute and dimnames = list(NULL, NULL) the same.

HenrikBengtsson commented 1 year ago

Thanks again.

First observation

There's nothing that changed in the behavior of useNames = TRUE since matrixStats 0.63.0;

library(matrixStats)
stopifnot(packageVersion("matrixStats") == "0.63.0")
X <- matrix(1:5, ncol = 1L)
Xs <- list(
  none  = X,
  null  = structure(X, dimnames = list(NULL, NULL))
)

ys <- lapply(Xs, FUN = colCumsums, useNames = TRUE)
str(ys)

gives

List of 2
 $ none: int [1:5, 1] 1 3 6 10 15
 $ null: int [1:5, 1] 1 3 6 10 15
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : NULL

What changed is that we're moving from useNames = NA to useNames = TRUE, by default (#232).

Second observation

Base R treats empty dimnames the same as no dimnames, e.g.

> ys <- lapply(Xs, FUN = colSums)
> str(ys)
List of 2
 $ none: num 15
 $ null: num 15

ys <- lapply(Xs, FUN = colMeans)
> str(ys)
List of 2
 $ none: num 3
 $ null: num 3

ys <- lapply(Xs, FUN = apply, MARGIN = 2L, sum)
> str(ys)
List of 2
 $ none: int 15
 $ null: int 15

ys <- lapply(Xs, FUN = apply, MARGIN = 2L, mean)
> str(ys)
List of 2
 $ none: num 3
 $ null: num 3

Thus, this is how matrixStats should also do it.

Third observation

matrixStats behaves like base R, except for the cumulative functions, e.g. all of the following returns unnamed results:

ys <- lapply(Xs, FUN = colMeans2, useNames = TRUE)
ys <- lapply(Xs, FUN = colMedians, useNames = TRUE)
ys <- lapply(Xs, FUN = colMads, useNames = TRUE)
ys <- lapply(Xs, FUN = colVars, useNames = TRUE)
ys <- lapply(Xs, FUN = colIQRs, useNames = TRUE)
ys <- lapply(Xs, FUN = colLogSumExps, useNames = TRUE)
ys <- lapply(Xs, FUN = colWeightedMeans, useNames = TRUE)
ys <- lapply(Xs, FUN = colWeightedMedians, useNames = TRUE)

However, the following, cumulative functions treats the two cases differently:

ys <- lapply(Xs, FUN = colCumsums, useNames = TRUE)
ys <- lapply(Xs, FUN = colCumprods, useNames = TRUE)
ys <- lapply(Xs, FUN = colCummins, useNames = TRUE)
ys <- lapply(Xs, FUN = colCummaxs, useNames = TRUE)

Fourth observation

BTW, setting dimnames to character(0L) is the same as setting them to NULL in R, e.g.

> str(X)
 int [1:5, 1] 1 2 3 4 5
> rownames(X) <- character(0)
> str(X)
 int [1:5, 1] 1 2 3 4 5
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL
> rownames(X) <- NULL
> str(X)
 int [1:5, 1] 1 2 3 4 5
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL

In other words, we don't have to treat character(0) specially; it's taken care of by R itself.

Action

HenrikBengtsson commented 1 year ago

Correction

There are more functions:

ys <- lapply(Xs, FUN = colRanges, useNames = TRUE)
ys <- lapply(Xs, FUN = colRanks, useNames = TRUE)

also preserves NULL dimnames.

I think what they have in common is that they call an internal C function setDimnames();

$ grep -F "setDimnames(" src/*.c
src/colRanges.c:            setDimnames(ans, dimnames, ncols, ccols, 0, crows, TRUE);
src/colRanges.c:            setDimnames(ans, dimnames, ncols, ccols, 0, crows, TRUE);
src/naming.c:void setDimnames(SEXP mat/*Answer matrix*/, SEXP dimnames, R_xlen_t nrows,
src/rowCummaxs.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCummaxs.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCummins.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCummins.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCumprods.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCumprods.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCumsums.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowCumsums.c:        setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanges.c:            setDimnames(ans, dimnames, nrows, crows, 0, ccols, FALSE);
src/rowRanges.c:            setDimnames(ans, dimnames, nrows, crows, 0, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);
src/rowRanksWithTies.c:              setDimnames(ans, dimnames, nrows, crows, ncols, ccols, FALSE);

which means we can probably fix this in a single location. I'll have a look.

HenrikBengtsson commented 1 year ago

@yaccos , could you please have a lock at

https://github.com/HenrikBengtsson/matrixStats/blob/7a45f8515a968e52bd92f08e32a14acc97f8b2aa/src/naming.c#L61-L117

and see if the problem reported here, can be fixed there?

yaccos commented 1 year ago

Yes, you can. If you add the special case test:


/* In case both elements of the dimname is NULL, we disregard the name 
attribute completely in order to conform to base R behavior */
if (VECTOR_ELT(dimnames, 0) == R_NilValue && VECTOR_ELT(dimnames, 1) == R_NilValue) {
    return;
}

at the top of the function, you should get the expected behavior. Besides, it annoyes me that many of the calls to setDimnames is duplicated even if there is no good reason for it. The worst offender here is rowRanksWithTies where the code for generating the answer is split across a multitude of cases, but the naming would be achieved the same way for all cases if it was put to the end of the function. I apologize for not being vigilent enough to discover this redundancy during the GSOC project. https://github.com/HenrikBengtsson/matrixStats/blob/7a45f8515a968e52bd92f08e32a14acc97f8b2aa/src/rowRanksWithTies.c#L68-L401

HenrikBengtsson commented 1 year ago

Thanks for the prompt fix @yaccos ! I've added it to matrixStats 0.63.0-9010.

... the naming would be achieved the same way for all cases if it was put to the end of the function. I apologize for not being vigilent enough to discover this redundancy during the GSOC project.

If you've got time, feel free to do a PR for this. And absolutely no worries; it's really hard to be on top of everything at all time, and can take years and a fresh mind to sometimes spot even the most obvious things.

const-ae commented 1 year ago

Thanks for the quick fix and sorry for the poor description.


There is one more function (colDiffs) that returns empty dimnames:

mat_empty_names <- matrix(1, nrow = 5, ncol = 1, dimnames = list(NULL, NULL))
str(matrixStats::colCumsums(mat_empty_names))
#>  num [1:5, 1] 1 2 3 4 5
str(matrixStats::colDiffs(mat_empty_names))
#>  num [1:4, 1] 0 0 0 0
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : NULL

Created on 2023-06-01 with reprex v2.0.2

yaccos commented 1 year ago

Thanks for the quick fix and sorry for the poor description.

There is one more function (colDiffs) that returns empty dimnames:

mat_empty_names <- matrix(1, nrow = 5, ncol = 1, dimnames = list(NULL, NULL))
str(matrixStats::colCumsums(mat_empty_names))
#>  num [1:5, 1] 1 2 3 4 5
str(matrixStats::colDiffs(mat_empty_names))
#>  num [1:4, 1] 0 0 0 0
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : NULL

Created on 2023-06-01 with reprex v2.0.2

colDiffs() and rowDiffs() have their own respective C-level functions for naming (https://github.com/HenrikBengtsson/matrixStats/blob/fa94b84cee51b619f10f29bfc48ee8a2b0ac7780/src/naming.c#L128 and https://github.com/HenrikBengtsson/matrixStats/blob/fa94b84cee51b619f10f29bfc48ee8a2b0ac7780/src/naming.c#L193). This means that the same fix must be applied to these locations as well.

HenrikBengtsson commented 1 year ago

Thanks both. Fixed in matrixStats 0.63.0-9011.