HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
202 stars 33 forks source link

Default 'ties.method' differs between rowRanks() and rank() #142

Open pneuvial opened 5 years ago

pneuvial commented 5 years ago

By default rank() uses the 'average' method to break ties, while rowRanks() uses the 'max' method:

> args(rank)
function (x, na.last = TRUE, ties.method = c("average", "first",
    "last", "random", "max", "min"))
> args(matrixStats::rowRanks)
function (x, rows = NULL, cols = NULL, ties.method = c("max",
    "average", "min"), dim. = dim(x), ...)

Shouldn't the default argument be the same in both functions?

HenrikBengtsson commented 5 years ago

Good point. Hmm... could be a mistake or it could be that rank() has been updated since rowRanks() was introduced.

HenrikBengtsson commented 5 years ago

Nah... the reason is probably that it was only ties.method = "max" that was supported in the early days:

Version: 0.7.0 [2013-01-14]

NEW FEATURES:

Version: 0.6.4 [2013-01-13]

NEW FEATURES:

Version: 0.4.0 [2011-11-11]

[...]

NEW FEATURES:

pneuvial commented 5 years ago

In any case I'm glad that ties.method = "average" is also supported: it's the one used in wilcox.test(), so I've been able to make a rowWilcoxonTests() function using rowRanks(ties.method = "average").

HenrikBengtsson commented 5 years ago

Cool. Is your rowWilcoxonTests() available online?

karoliskoncevicius commented 5 years ago

Subscribed to notifications of matrixStats and found this. Just wanted to add that I also used rowRanks a few times when implementing very similar things - row_kruskalwallis()

So this is mainly a praise of matrixStats - it is extremely useful for speeding up multiple statistical tests on genomic-like datasets.

And by coincidence "row_wilcoxon" is next on the list of things that will get included in that package.

pneuvial commented 5 years ago

Interesting! I did not know about the matrixTests package, it's very nice!

My rowWilcoxonTests() is available in the sansSouci package: https://github.com/pneuvial/sanssouci/blob/develop/R/rowWilcoxonTests.R

Of course it would make more sense to have this or a version of it in a package specifically dedicated to tests on the rows of a matrix!

NB: as noted in the Details of the man page:

The p-values are computed using the normal approximation as described in the \code{\link{wilcox.test}} function. The exact p-values (which can be useful for small samples with no ties) are not implemented (yet).

FYI Karolis: I also have rowWelchTests() and rowBinomialTests() in the same package.

karoliskoncevicius commented 5 years ago

Thank you a lot for the sharing the reference to your package Pierre! Will definitely take a look.

From time to time I search for packages that implement similar things and everything I find I add to the references in the README under "See Also" section. Added yours now too. I probably missed it because it wasn't on CRAN nor BioConductor.

Also think yours is the first time I see Wilcoxon Test implemented in such a way. Most others typically implement fast versions of t-tests and are done.

The p-values are computed using the normal approximation as described in the \code{\link{wilcox.test}} function. The exact p-values (which can be useful for small samples with no ties) are not implemented (yet).

Yes, this is the main thing that held me back from implementing Wilcoxon test in matrixTests. I try to maintain high fidelity with tests that are already in R, so all these approximations and when they are used sometimes takes time to get "right". Especially when they depend on sample size and there are different number of missing NA values in different rows.

bkmontgom commented 5 years ago

I submitted a PR that addresses this and adds "first", "last", "random", and "dense" ties.methods.

HenrikBengtsson commented 5 years ago

Thank again for adding this (PR #146).

We are going to need several release cycles to change the default for argument ties.methods without (silently) messing up the results for users/pipelines that rely on rowRanks(). This will probably involve something like:

  1. Release 1: If missing(ties.method) == TRUE, then produce an informative warning saying this argument needs to be set to ties.method = "max" to avoid the warning. Here we could possibly hijack .Deprecated() warnings, which with some luck will be reported by R CMD check (some updates have been made toward this direction).

  2. Release 2: If missing(ties.method) == TRUE, turn above warning into an error. Here we can also change the default to be ties.method = "average".

  3. Release 3: Allow for missing(ties.method) == TRUE again.

To speed up the above, I could run reverse dependency checks with missing(ties.method) == TRUE producing an error to catch obvious candidates and report to their maintainers.

HenrikBengtsson commented 3 years ago

I've scanned all 347 reverse dependencies on CRAN and Bioconductor 3.13 (R 4.1.0) for those that use colRanks() or rowRanks() and I identified the following packages:

Thus, as of 2021-06-22, there are 12 packages in total that rely on these matrixStats functions:

Bioconductor:

  1. clustifyr

  2. EventPointer

  3. fishpond

  4. MatrixGenerics

  5. MetaNeighbor

    • Code: ???
    • Result: ???
    • R_MATRIXSTATS_TIES_METHOD_MISSING=defunct => 'OK'
  6. reconsi

  7. singscore

  8. STROMA4

    • Code: R/random.ranks.R: matrixStats::rowRanks(datrank.rand.col)
    • Result: FAIL (does not specify ties.method),
    • R_MATRIXSTATS_TIES_METHOD_MISSING=defunct => R CMD check ERROR
  9. DelayedMatrixStats

  1. sparseMatrixStats

CRAN:

  1. HDSpatialScan
  2. matrixTests
  3. nparMD
  4. WGCNA

These are the packages that we must make sure don't make assumptions about the default ties.method value.

UPDATE: I guess my "scan" script produces false negatives, because I'd expect DelayMatrixStats and sparseMatrixStats to also show up above, but they don't.

UPDATE 2: It might be that DelayMatrixStats and sparseMatrixStats only imports from matrixStats in their package tests, which I did not scan.

HenrikBengtsson commented 3 months ago

I ran another scan for packages not specifying the ties.method argument and found the following.

CRAN:

Bioconductor:

This won't become defunct anytime soon, but I will make it deprecated (i.e. produce warnings) starting with R (>= 4.4.0) [2024-04-24] and therefore also Bioconductor (>= 3.19) [2024-05-01].

HenrikBengtsson commented 3 months ago

This won't become defunct anytime soon, but I will make it deprecated (i.e. produce warnings) starting with R (>= 4.4.0) [2024-04-24] and therefore also Bioconductor (>= 3.19) [2024-05-01].

This is implemented in matrixStats 1.3.0, which is now on CRAN.