Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Subsetting with drop=TRUE doesn't behave as expected #6

Closed LTLA closed 5 years ago

LTLA commented 6 years ago

As the title suggests:

library(DelayedArray)
blah <- matrix(runif(1000), ncol=20, nrow=50)
X <- DelayedArray(blah)
X[,1,drop=TRUE]
## <50 x 1> DelayedMatrix object of type "double":
##            [,1]
##  [1,] 0.9186617
##  [2,] 0.4121906
##  [3,] 0.7185965
##  [4,] 0.3157570
##  [5,] 0.2866116
##   ...         .
## [46,] 0.6924624
## [47,] 0.1779758
## [48,] 0.8557219
## [49,] 0.5277486
## [50,] 0.7759722

I would have expected a vector as I would have gotten from blah[,1,drop=TRUE].

Running on DelayedArray version 0.5.22.

hpages commented 6 years ago

Hey Aaron, drop is ignored when subsetting a DelayedArray object. See ?DelayedArray I added a warning in DelayedArray 0.5.24 if you explicitly specify a non-FALSE value for drop. H.

LTLA commented 6 years ago

Oh! I didn't even see this. Is there a reason for ignoring drop=?

hpages commented 6 years ago

Good question. When you say that you'd have expected DelayedArray(blah)[ , 1, drop=TRUE] to return a vector in your original example, do you mean an ordinary vector or a mono-dimensional DelayedArray object? Doing the former wouldn't be good. I thought I had a good reason for not doing the latter either but I can't remember it. Probably worth revisiting this. I'll let you know.

LTLA commented 6 years ago

Actually, I would have expected the former, similar to how Matrix behaves:

library(Matrix)
a <- rsparsematrix(10, 20, 0.1)
a[,1] # gives an ordinary R vector

You could say that this is silly default behaviour, and definitely, it's one of the standard R tripwires for new users. But it's the default for all other R matrices, and a more experienced user would expect it to be there.

PeteHaitch commented 6 years ago

2 cents: I don't mind the option for x[, 1] to realise the result in memory, but I wouldn't like it as the default via drop = TRUE.

LTLA commented 6 years ago

After a while, I started to understand why drop=TRUE was the default. In the majority of interactive analyses, I would be using the result of subsetting as a vector anyway (e.g., for plotting, to use in lm(), and so on), and drop=TRUE saves me from typing an extra as.vector() every time.

It only ever caused problems when I was writing functions that take arbitrary input for subsetting. In such cases, specifying an extra drop=FALSE is not particularly onerous given that I've written an entire function in the first place. Moreover, if we want to be able to write R functions that are agnostic to the matrix representation, then it becomes much more important that [ behaves consistently; some developers may depend on the expected default behaviour of [ for all other matrices (i.e., returning a vector).

PeteHaitch commented 6 years ago

I can see the argument that it may simplify interfacing with existing code and that it ensures consistency with the ingrained (albeit arguably bad) default of drop = TRUE.

My hesitance is because I've been taking a very explicit approach to realization, namely "don't do it unless I really, really tell you to" and so I've been happy to use as.vector(), as.matrix(), as.array(), etc. when necessary.

LTLA commented 6 years ago

Are we talking about DelayedMatrixStats here, or just in general usage?

For the former: I didn't realize there were instances where you were accessing individual rows/columns. I would have thought that you would be using a block processing strategy for all functions.

For the latter: I understand the frustration of waiting for realization if one does [ and drop=TRUE by default. But this feels like a cost that needs to be paid sooner or later for most uses of data from single row/column.

Anyway, off the top of my head, I can think of at least a few pieces of code under my management that will break with DelayedMatrix inputs if drop=FALSE by default. Most of these are plotting-related, i.e., pulling out a gene's expression to colour points by (in various scater functions, or in iSEE).

PeteHaitch commented 6 years ago

DelayedMatrixStats: Yes, block-processing rather than row- or column-wise processing is what I should be doing everywhere. I need to revisit this now that I have more experience and with recent changes to DelayedArray. An aside: I'm typically using row-blocks or column-blocks regardless of the chunking strategy of the underlying data, which is sub-optimal for data access but greatly simplifies the implementing the methods.

Other stuff: Right now, I'm adding support for DelayedArray-backed objects to minfi. There I'm facing all sorts of data access patterns, including column-wise processing of ordinary matrices, since minfi has code written by a bunch of different people, all with very different coding styles.

FWIW the situation with subsetting an array vs a DelayedArray as basically the same that tibble developers faced with subsetting a data.frame vs. a tibble. There, the developers decided to make drop = FALSE the default, which of course has its own problems.

hpages commented 6 years ago

I've made that change: 5ed973c87a1d065281c32b609272b53c745868ae

I'm really hesitant to take the extra step of turning the returned object into an ordinary vector when it's mono-dimensional. I feel like automatically triggering a possibly costly realization will probably surprise most users. We could make an exception when the returned object is a mono-dimensional DelayedArray of length 1 though. What do you guys think?

LTLA commented 6 years ago

From a developer perspective, realization to a vector would be the expected outcome, as I mentioned before. Failing to do so would cause at least a few incompatibilities with existing code. Easily fixed with a forced drop(), but this seems like unnecessary work if we want it to be a drop-in matrix substitute.

From a user perspective - I don't see much use for a monodimensional DelayedArray without realization in an interactive session. I mean, is anyone going to (routinely) take individual row/columns, do something to them, and then... not use the result? That seems like a pretty rare use case to me, IMHO.

Moreover, the cost of realization of a single row/column is not too bad for an interactive session. In most cases, it should return quickly enough for the user to correct their mistake if they intended drop=FALSE. Unless it requires a read from a HDF5 file, but if the data's that big, they should have been more careful.

Besides, at a certain level of user (like me, at least), the drop=TRUE behaviour is expected, so I was surprised when this did not happen. And for other users... well, they better learn quick, because that's how all other R matrices behave.

PeteHaitch commented 6 years ago

@hpages one oddity with the current implementation

suppressPackageStartupMessages(library(DelayedArray))
x <- matrix(1:10, ncol = 2)
dim(x[, 1])
#> NULL
X <- DelayedArray(x)
dim(X[, 1])
#> [1] 5
hpages commented 6 years ago

@PeteHaitch That's expected. The result of this subsetting is a mono-dimensional DelayedArray. The one dimension of a mono-dimensional DelayedArray object is reported by dim(). I don't want dim() to return NULL on a DelayedArray object, otherwise it cannot be considered an array-like object anymore. We should introduce something like DelayedAtomicVector as an equivalent to a mono-dimensional DelayedArray object, except that this time dim() would return NULL. Then [ could return a DelayedAtomicVector instead of a mono-dimensional DelayedArray object. However I'm not really keen on starting the DelayedAtomicVector adventure for now.

@LTLA Mono-dimensional DelayedArray objects are actually being used in the VariantExperiment package ( https://github.com/Bioconductor/VariantExperiment ). They are the founding blocks of the DelayedDataFrame objects defined in this package. If we modify [ to realize the result when it's mono-dimensional, then subsetting a mono-dimensional DelayedArray would realize it every time (unless the end user remembers to use drop=FALSE).

PeteHaitch commented 6 years ago

@hpages Understood, thanks for the explanation.

LTLA commented 6 years ago

@hpages It feels that specialist behaviour for this DelayedDataFrame class (i.e., to not drop dims) is a job for its own subset method. In fact, looking at DelayedDataFrame-class.R suggests that there's never actually any subsetting at all via [ - it's all done with slot assignments to indices in extractROWS().

hpages commented 6 years ago

@LTLA My point is that VariantExperiment users are going to be exposed to a lot of mono-dimensional DelayedArray objects and interact with them. For example if ve is a VariantExperiment object, they might do things like ve$batchid[ve$batchid == 1]. This could possibly return a very long mono-dimensional DelayedArray object. Not sure it would be a good thing to realize it by default.

I feel like it's a situation where we need to choose between 2 inconveniences. If [ realizes the 1D result by default, the inconvenience would be on the user (getting unwanted realizations, unless s/he remembers to use drop=FALSE). If [ doesn't realize them, the inconvenience is on the developer (s/he needs to remember to call as.vector() on the result if s/he wants realization). For now I'm inclined to leave the inconvenience on the developer.

hpages commented 6 years ago

BTW, something like A[A == 1] triggers an error at the moment on a 1D DelayedArray. Need to fix that.

LTLA commented 6 years ago

@hpages Okay. It seems that the long-term solution would be for VariantExperiment to switch to the proposed DelayedAtomicVector class, which is closer to how it really wants the data to be represented. This would then enable drop=TRUE in DelayedArray with the expected consequences.

hpages commented 6 years ago

@LTLA The DelayedAtomicVector class is definitely worth exploring. I'm adding it to the TODO list.

hpages commented 6 years ago

I fixed the A[A > 0] (or A[A > 0 & A*log(A) >= 5]) idiom. See a3b9837f9deb6f3d73e8a77faf65be7d7fe51bbf This was not only broken on 1D DelayedArray objects: it was not supported on DelayedArray objects in general.

Working on this made me slightly change my mind about what should happen when doing A[A > 0] with a 1D object. Now it returns an ordinary vector (unless drop=FALSE). Yes I know, I was arguing earlier that this would be inconvenient in the context of the VariantExperiment use case. But when A has more than 1 dimension, A[A > 0] is always an ordinary vector (drop is ignored). This is a strong argument for doing the same thing when A is 1D. I think I'm slowly changing my mind towards having things like M[5, ] also return an ordinary vector (unless drop=FALSE). But first I need to sleep ;-)

LTLA commented 5 years ago

I can confirm that this inconsistency in behaviour is a real problem developer-side. We recently transitioned some of our iSEE instances to use HDF5Array (see https://marionilab.cruk.cam.ac.uk/iSEE_tcga/ for an example), but this was far from a drop-in replacement. We had to hunt down every instance where we extracted a row or column from a matrix-like structure to wrap it in as.vector(), to avoid a series of cryptic errors upon entering shiny. Even then, it took a few days of debugging to weed out all the problems.

One can only imagine the chaos that will ensue when DAs are used more generally in functions that do not explicitly consider and protect against this dimensionality-preserving behaviour.

hpages commented 5 years ago

OK let's do it!

hpages commented 5 years ago

Done in DelayedArray 0.7.44: https://github.com/Bioconductor/DelayedArray/commit/0e1d3bfc854df2a86e2b81a7782ef7cc24e59e9e

library(DelayedArray)
M <- DelayedArray(matrix(1:15, ncol=3))

M
# <5 x 3> DelayedMatrix object of type "integer":
#      [,1] [,2] [,3]
# [1,]    1    6   11
# [2,]    2    7   12
# [3,]    3    8   13
# [4,]    4    9   14
# [5,]    5   10   15

M[ , 3]
# [1] 11 12 13 14 15

So drop=TRUE now not only drops ineffective dimensions but also the class if only 1 dimension remains after this dropping. One drawback with this change is that there is no easy way to get the 1D result as a DelayedArray anymore. Using drop=FALSE preserves both the class (DelayedArray) and the ineffective dimensions:

M[ , 3, drop=FALSE]
# <5 x 1> DelayedMatrix object of type "integer":
#      [,1]
# [1,]   11
# [2,]   12
# [3,]   13
# [4,]   14
# [5,]   15

To drop the ineffective dimensions but preserve the class, now you have to do something like:

M2 <- M[ , 3, drop=FALSE]
perm <- which(dim(M2) != 1)
if (length(perm) == 0) perm <- 1

aperm(M2, perm)
# <5> DelayedArray object of type "integer":
# [1] [2] [3] [4] [5] 
#  11  12  13  14  15 

Ouch!

I wonder how to best deal with this. Via a 3rd value to the drop argument? Via a dedicated drop2() function? Via both? Hopefully the use case is rare enough so I still have time to think about this.

H.

LTLA commented 5 years ago

Thanks Herve, this should make things a lot easier.

Regarding the dropping; the current behaviour of M[ , 3, drop=FALSE] seems like the correct behaviour IMO. For other representations, you get a submatrix of the same dimensionality when drop=FALSE:

library(Matrix)
S <- rsparsematrix(5, 3, 0.2)
S[, 3, drop=FALSE]
## 5 x 1 sparse Matrix of class "dgCMatrix"
##       
## [1,] .
## [2,] .
## [3,] .
## [4,] .
## [5,] .

... and so on for matrix (obviously):

Y <- as.matrix(S)
Y[, 3, drop=FALSE]
##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0
## [4,]    0
## [5,]    0

... so I don't see why people would (generally) expect a 1D DelayedArray after subsetting. Of course, I can see some specific applications (e.g., VariantExperiment) but that feels like an exception rather than the rule - as you say, a dedicated function might be sufficient for those use cases.

hpages commented 5 years ago

Yes, M[ , 3, drop=FALSE] is consistent with what subsetting does on other matrix-like objects (and has always been, I didn't change that). However there is no easy way now to get what M[ , 3, drop=TRUE] used to give you (i.e. a 1D DelayedArray), even though you're right that most of the time people (and a lot of existing code around) expect M[ , 3, drop=TRUE] to return an ordinary vector.

Note that this was not my expectation originally because if you read carefully the man page for [ and drop in base R, they're really about dropping dimensions. Nothing is said about dropping the class. They even kind of suggest that the class should be preserved:

If 'x' is an object with a 'dim' attribute (e.g., a matrix or array), then 'drop' returns
an object **like** 'x', but with any extents of length one removed.

But you convinced me that in this case I should align my expectation with that of most people and existing code ;-) Thanks!

LTLA commented 5 years ago

Looks like the DelayedArray update did the job w.r.t. our iSEE instances; I'm going to close this.