Closed LTLA closed 5 years ago
Hey Aaron,
drop
is ignored when subsetting a DelayedArray object. See ?DelayedArray
I added a warning in DelayedArray 0.5.24 if you explicitly specify a non-FALSE value for drop
.
H.
Oh! I didn't even see this. Is there a reason for ignoring drop=
?
Good question. When you say that you'd have expected DelayedArray(blah)[ , 1, drop=TRUE]
to return a vector in your original example, do you mean an ordinary vector or a mono-dimensional DelayedArray object? Doing the former wouldn't be good. I thought I had a good reason for not doing the latter either but I can't remember it. Probably worth revisiting this. I'll let you know.
Actually, I would have expected the former, similar to how Matrix behaves:
library(Matrix)
a <- rsparsematrix(10, 20, 0.1)
a[,1] # gives an ordinary R vector
You could say that this is silly default behaviour, and definitely, it's one of the standard R tripwires for new users. But it's the default for all other R matrices, and a more experienced user would expect it to be there.
2 cents: I don't mind the option for x[, 1]
to realise the result in memory, but I wouldn't like it as the default via drop = TRUE
.
After a while, I started to understand why drop=TRUE
was the default. In the majority of interactive analyses, I would be using the result of subsetting as a vector anyway (e.g., for plotting, to use in lm()
, and so on), and drop=TRUE
saves me from typing an extra as.vector()
every time.
It only ever caused problems when I was writing functions that take arbitrary input for subsetting. In such cases, specifying an extra drop=FALSE
is not particularly onerous given that I've written an entire function in the first place. Moreover, if we want to be able to write R functions that are agnostic to the matrix representation, then it becomes much more important that [
behaves consistently; some developers may depend on the expected default behaviour of [
for all other matrices (i.e., returning a vector).
I can see the argument that it may simplify interfacing with existing code and that it ensures consistency with the ingrained (albeit arguably bad) default of drop = TRUE
.
My hesitance is because I've been taking a very explicit approach to realization, namely "don't do it unless I really, really tell you to" and so I've been happy to use as.vector()
, as.matrix()
, as.array()
, etc. when necessary.
Are we talking about DelayedMatrixStats here, or just in general usage?
For the former: I didn't realize there were instances where you were accessing individual rows/columns. I would have thought that you would be using a block processing strategy for all functions.
For the latter: I understand the frustration of waiting for realization if one does [
and drop=TRUE
by default. But this feels like a cost that needs to be paid sooner or later for most uses of data from single row/column.
Anyway, off the top of my head, I can think of at least a few pieces of code under my management that will break with DelayedMatrix inputs if drop=FALSE
by default. Most of these are plotting-related, i.e., pulling out a gene's expression to colour points by (in various scater functions, or in iSEE).
DelayedMatrixStats: Yes, block-processing rather than row- or column-wise processing is what I should be doing everywhere. I need to revisit this now that I have more experience and with recent changes to DelayedArray. An aside: I'm typically using row-blocks or column-blocks regardless of the chunking strategy of the underlying data, which is sub-optimal for data access but greatly simplifies the implementing the methods.
Other stuff: Right now, I'm adding support for DelayedArray-backed objects to minfi. There I'm facing all sorts of data access patterns, including column-wise processing of ordinary matrices, since minfi has code written by a bunch of different people, all with very different coding styles.
FWIW the situation with subsetting an array vs a DelayedArray as basically the same that tibble developers faced with subsetting a data.frame vs. a tibble. There, the developers decided to make drop = FALSE
the default, which of course has its own problems.
I've made that change: 5ed973c87a1d065281c32b609272b53c745868ae
I'm really hesitant to take the extra step of turning the returned object into an ordinary vector when it's mono-dimensional. I feel like automatically triggering a possibly costly realization will probably surprise most users. We could make an exception when the returned object is a mono-dimensional DelayedArray of length 1 though. What do you guys think?
From a developer perspective, realization to a vector would be the expected outcome, as I mentioned before. Failing to do so would cause at least a few incompatibilities with existing code. Easily fixed with a forced drop()
, but this seems like unnecessary work if we want it to be a drop-in matrix substitute.
From a user perspective - I don't see much use for a monodimensional DelayedArray
without realization in an interactive session. I mean, is anyone going to (routinely) take individual row/columns, do something to them, and then... not use the result? That seems like a pretty rare use case to me, IMHO.
Moreover, the cost of realization of a single row/column is not too bad for an interactive session. In most cases, it should return quickly enough for the user to correct their mistake if they intended drop=FALSE
. Unless it requires a read from a HDF5 file, but if the data's that big, they should have been more careful.
Besides, at a certain level of user (like me, at least), the drop=TRUE
behaviour is expected, so I was surprised when this did not happen. And for other users... well, they better learn quick, because that's how all other R matrices behave.
@hpages one oddity with the current implementation
suppressPackageStartupMessages(library(DelayedArray))
x <- matrix(1:10, ncol = 2)
dim(x[, 1])
#> NULL
X <- DelayedArray(x)
dim(X[, 1])
#> [1] 5
@PeteHaitch That's expected. The result of this subsetting is a mono-dimensional DelayedArray. The one dimension of a mono-dimensional DelayedArray object is reported by dim()
. I don't want dim()
to return NULL on a DelayedArray object, otherwise it cannot be considered an array-like object anymore. We should introduce something like DelayedAtomicVector as an equivalent to a mono-dimensional DelayedArray object, except that this time dim()
would return NULL. Then [
could return a DelayedAtomicVector instead of a mono-dimensional DelayedArray object. However I'm not really keen on starting the DelayedAtomicVector adventure for now.
@LTLA Mono-dimensional DelayedArray objects are actually being used in the VariantExperiment package ( https://github.com/Bioconductor/VariantExperiment ). They are the founding blocks of the DelayedDataFrame objects defined in this package. If we modify [
to realize the result when it's mono-dimensional, then subsetting a mono-dimensional DelayedArray would realize it every time (unless the end user remembers to use drop=FALSE
).
@hpages Understood, thanks for the explanation.
@hpages It feels that specialist behaviour for this DelayedDataFrame
class (i.e., to not drop dims) is a job for its own subset method. In fact, looking at DelayedDataFrame-class.R
suggests that there's never actually any subsetting at all via [
- it's all done with slot assignments to indices in extractROWS()
.
@LTLA My point is that VariantExperiment users are going to be exposed to a lot of mono-dimensional DelayedArray objects and interact with them. For example if ve
is a VariantExperiment object, they might do things like ve$batchid[ve$batchid == 1]
. This could possibly return a very long mono-dimensional DelayedArray object. Not sure it would be a good thing to realize it by default.
I feel like it's a situation where we need to choose between 2 inconveniences. If [
realizes the 1D result by default, the inconvenience would be on the user (getting unwanted realizations, unless s/he remembers to use drop=FALSE
). If [
doesn't realize them, the inconvenience is on the developer (s/he needs to remember to call as.vector()
on the result if s/he wants realization). For now I'm inclined to leave the inconvenience on the developer.
BTW, something like A[A == 1]
triggers an error at the moment on a 1D DelayedArray. Need to fix that.
@hpages Okay. It seems that the long-term solution would be for VariantExperiment to switch to the proposed DelayedAtomicVector
class, which is closer to how it really wants the data to be represented. This would then enable drop=TRUE
in DelayedArray
with the expected consequences.
@LTLA The DelayedAtomicVector class is definitely worth exploring. I'm adding it to the TODO list.
I fixed the A[A > 0]
(or A[A > 0 & A*log(A) >= 5]
) idiom. See a3b9837f9deb6f3d73e8a77faf65be7d7fe51bbf This was not only broken on 1D DelayedArray objects: it was not supported on DelayedArray objects in general.
Working on this made me slightly change my mind about what should happen when doing A[A > 0]
with a 1D object. Now it returns an ordinary vector (unless drop=FALSE
). Yes I know, I was arguing earlier that this would be inconvenient in the context of the VariantExperiment use case. But when A
has more than 1 dimension, A[A > 0]
is always an ordinary vector (drop
is ignored). This is a strong argument for doing the same thing when A
is 1D. I think I'm slowly changing my mind towards having things like M[5, ]
also return an ordinary vector (unless drop=FALSE
). But first I need to sleep ;-)
I can confirm that this inconsistency in behaviour is a real problem developer-side. We recently transitioned some of our iSEE instances to use HDF5Array
(see https://marionilab.cruk.cam.ac.uk/iSEE_tcga/ for an example), but this was far from a drop-in replacement. We had to hunt down every instance where we extracted a row or column from a matrix-like structure to wrap it in as.vector()
, to avoid a series of cryptic errors upon entering shiny. Even then, it took a few days of debugging to weed out all the problems.
One can only imagine the chaos that will ensue when DAs are used more generally in functions that do not explicitly consider and protect against this dimensionality-preserving behaviour.
OK let's do it!
Done in DelayedArray 0.7.44: https://github.com/Bioconductor/DelayedArray/commit/0e1d3bfc854df2a86e2b81a7782ef7cc24e59e9e
library(DelayedArray)
M <- DelayedArray(matrix(1:15, ncol=3))
M
# <5 x 3> DelayedMatrix object of type "integer":
# [,1] [,2] [,3]
# [1,] 1 6 11
# [2,] 2 7 12
# [3,] 3 8 13
# [4,] 4 9 14
# [5,] 5 10 15
M[ , 3]
# [1] 11 12 13 14 15
So drop=TRUE
now not only drops ineffective dimensions but also the class if only 1 dimension remains after this dropping. One drawback with this change is that there is no easy way to get the 1D result as a DelayedArray anymore. Using drop=FALSE
preserves both the class (DelayedArray) and the ineffective dimensions:
M[ , 3, drop=FALSE]
# <5 x 1> DelayedMatrix object of type "integer":
# [,1]
# [1,] 11
# [2,] 12
# [3,] 13
# [4,] 14
# [5,] 15
To drop the ineffective dimensions but preserve the class, now you have to do something like:
M2 <- M[ , 3, drop=FALSE]
perm <- which(dim(M2) != 1)
if (length(perm) == 0) perm <- 1
aperm(M2, perm)
# <5> DelayedArray object of type "integer":
# [1] [2] [3] [4] [5]
# 11 12 13 14 15
Ouch!
I wonder how to best deal with this. Via a 3rd value to the drop
argument? Via a dedicated drop2()
function? Via both? Hopefully the use case is rare enough so I still have time to think about this.
H.
Thanks Herve, this should make things a lot easier.
Regarding the dropping; the current behaviour of M[ , 3, drop=FALSE]
seems like the correct behaviour IMO. For other representations, you get a submatrix of the same dimensionality when drop=FALSE
:
library(Matrix)
S <- rsparsematrix(5, 3, 0.2)
S[, 3, drop=FALSE]
## 5 x 1 sparse Matrix of class "dgCMatrix"
##
## [1,] .
## [2,] .
## [3,] .
## [4,] .
## [5,] .
... and so on for matrix
(obviously):
Y <- as.matrix(S)
Y[, 3, drop=FALSE]
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
## [4,] 0
## [5,] 0
... so I don't see why people would (generally) expect a 1D DelayedArray
after subsetting. Of course, I can see some specific applications (e.g., VariantExperiment) but that feels like an exception rather than the rule - as you say, a dedicated function might be sufficient for those use cases.
Yes, M[ , 3, drop=FALSE]
is consistent with what subsetting does on other matrix-like objects (and has always been, I didn't change that). However there is no easy way now to get what M[ , 3, drop=TRUE]
used to give you (i.e. a 1D DelayedArray), even though you're right that most of the time people (and a lot of existing code around) expect M[ , 3, drop=TRUE]
to return an ordinary vector.
Note that this was not my expectation originally because if you read carefully the man page for [
and drop
in base R, they're really about dropping dimensions. Nothing is said about dropping the class. They even kind of suggest that the class should be preserved:
If 'x' is an object with a 'dim' attribute (e.g., a matrix or array), then 'drop' returns
an object **like** 'x', but with any extents of length one removed.
But you convinced me that in this case I should align my expectation with that of most people and existing code ;-) Thanks!
Looks like the DelayedArray update did the job w.r.t. our iSEE instances; I'm going to close this.
As the title suggests:
I would have expected a vector as I would have gotten from
blah[,1,drop=TRUE]
.Running on DelayedArray version 0.5.22.