Bioconductor / SummarizedExperiment

A container (S4 class) for matrix-like assays
https://bioconductor.org/packages/SummarizedExperiment
33 stars 9 forks source link

all.equal does not work for SE objects #16

Closed LTLA closed 5 years ago

LTLA commented 6 years ago

This was a rather unpleasant little surprise:

library(SummarizedExperiment)
A <- SummarizedExperiment(matrix(1, 10, 10))
B <- SummarizedExperiment(matrix(2, 10, 10))
all.equal(A, B) # TRUE

I assume that this is due to some voodoo with environments inside assays. But this is worrying as it puts into question all of my unit tests across packages that work on SummarizedExperiments or their subclasses.

It seems that all.equal is a S3 generic, so perhaps a special method could be defined for the relevant inner class of the assays field (probably ShallowSimpleListAssays) to achieve the correct behaviour?

Session information ``` R Under development (unstable) (2018-11-02 r75535) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] SummarizedExperiment_1.13.0 DelayedArray_0.9.0 [3] BiocParallel_1.17.0 matrixStats_0.54.0 [5] Biobase_2.43.0 GenomicRanges_1.35.0 [7] GenomeInfoDb_1.19.0 IRanges_2.17.0 [9] S4Vectors_0.21.0 BiocGenerics_0.29.1 loaded via a namespace (and not attached): [1] lattice_0.20-38 bitops_1.0-6 grid_3.6.0 [4] zlibbioc_1.29.0 XVector_0.23.0 Matrix_1.2-15 [7] tools_3.6.0 RCurl_1.95-4.11 compiler_3.6.0 [10] GenomeInfoDbData_1.2.0 ```
LTLA commented 5 years ago

This seems to be the fault of all.equal.envRefClass() in base, which ignores the .xData slot of the two ShallowSimpleListAssays objects being compared. I presume that this is because .xData typically refers to external data resources that cannot be directly compared - but if so, why is ShallowSimpleListAssays implemented like that? It doesn't really have "external resources" in the same sense.

lawremi commented 5 years ago

The reason all.equal.envRefClass() ignores @.xData is because the methods package uses that slot as the environment for storing the actual fields of the object. The fields are checked separately.

The problem is that ShallowSimpleListAssays defines its own method on [[() for getting elements of the list, but the methods package assumes that [[() accesses fields on the object by name. I'm not sure whether this restriction is documented anywhere, but if you think about it, it would be best to avoid inconsistency with the behavior of $(). ShallowSimpleListAssays probably should be composed of a reference class object, rather than being one itself.

Do we still need ShallowSimpleListAssays since R is shallow copy by default? I guess this is for compatibility with serialized instances.

hpages commented 5 years ago

I implemented the [[ method for ShallowSimpleListAssays objects. I've no idea if the fact that implementing a [[ method for a ref class breaks basic functionalities like all.equal() is documented somewhere. It didn't even occur to me that such a restriction could exist (I'm naive and still haven't learned what the methods package is capable of after all these years).

Anyway, looks like reference classes as implemented in the methods package are a bit problematic (https://stat.ethz.ch/pipermail/bioc-devel/2019-May/015112.html, too problematic for my taste) so maybe the recommendation should be to avoid them. Especially since they are often used where they shouldn't (e.g. TxDb objects). In the case of the assays slot of an SE object they were introduced a long time ago by the father of the SE class and IIRC this was a trick to avoid triggering copies of the assays slot when the user does things like setting the rownames or colnames on the SE object. It's probably not needed anymore now that R is "shallow copy by default" but yeah there are hundreds (if not thousands) serialized instances around so we need to be careful when changing this.

The good news is that the assays slot only needs to be an "Assays" object and Assays is a virtual class with no slot:

showClass("Assays")
# Virtual Class "Assays" [package "SummarizedExperiment"]
#
# No Slots, prototype of class "SimpleListAssays"
#
# Known Subclasses: "SimpleListAssays", "ShallowSimpleListAssays", "AssaysInEnv"

ShallowSimpleListAssays is only 1 concrete subclass among others and is the one currently used by the SummarizedExperiment() constructor to wrap the assays passed to it. Should be easy to change this to SimpleListAssays which a direct SimpleList extension. Serialized instances that use ShallowSimpleListAssays should keep working.

I'll work on this, including replacing the [[ methods for the various Assays subclasses with getListElement methods. Will also start thinking about an all.equal() method for Assays derivatives that doesn't return false positives or false negatives. Sounds like a must! It will be expensive when the assay data is on disk but hey, the end user is asking for a deep comparison so should be ready for that.

hpages commented 5 years ago

SummarizedExperiment 1.15.1 should eliminate Aaron's unpleasant little surprise (see commit 1c9719593f1a5fcb3ebe1c8cff2029c6039445e0).

I've opened 2 separate issues (#21 and #22) for the other things discussed here.

Closing this issue (feel free to re-open if I missed something).

H.

LTLA commented 5 years ago

Thanks Herve... now, let's see how many of my (now correctly working) unit tests break!