Bioconductor / RaggedExperiment

Matrix-like representations of mutation and CN data
https://bioconductor.org/packages/RaggedExperiment
4 stars 3 forks source link

Ragged experiment from a sliced multi-assay experiment contains unexpected columns #13

Closed jason-weirather closed 7 years ago

jason-weirather commented 7 years ago

I've run into some trouble working with a RaggedExperiment I created. I am seeing extra columns in data after it is sliced, and accessing the data only returns the unexpected columns. I have replicated the problem in one of the Waldron datasets http://s3.amazonaws.com/multiassayexperiments/ccleMAEO.rds

library(MultiAssayExperiment)
library(RaggedExperiment)
ccle <- updateObject(readRDS('ccleMAEO.rds'))
ccle <- ccle[,ccle$TissueOrigin == 'BREAST']
muts <- ccle[c('BRCA1','BRCA2'), ,'Mutations']@ExperimentList$Mutations
muts

class: RaggedExperiment dim: 2 26 assays(0): rownames(2): BRCA1 BRCA2 colnames(26): AU565_BREAST BT20_BREAST ... ZR751_BREAST ZR7530_BREAST colData names(0):

The dimensions and names from this view are what we expect.

dimnames(muts)[2]

[[1]] [1] "AU565_BREAST" "BT20_BREAST" "BT474_BREAST" "BT549_BREAST" "CAL851_BREAST" "CAMA1_BREAST" "EFM19_BREAST" "HCC1187_BREAST" "HCC1395_BREAST" [10] "HCC1569_BREAST" "HCC1806_BREAST" "HCC1954_BREAST" "HCC70_BREAST" "HMC18_BREAST" "HS578T_BREAST" "HS739T_BREAST" "MCF7_BREAST" "MDAMB415_BREAST" [19] "MDAMB436_BREAST" "MDAMB453_BREAST" "MDAMB468_BREAST" "SKBR3_BREAST" "T47D_BREAST" "UACC812_BREAST" "ZR751_BREAST" "ZR7530_BREAST"

However if we access the names of the assays in the RaggedExperiment, we have do not see what we expect after our earlier slicing, and we have a list of a different length than the previous dimnames() access. Notice there is now PROSTATE and LUNG despite having been sliced earlier on BREAST and the previous command only accessing BREAST.

names(muts@assays)

[1] "22RV1_PROSTATE" "AU565_BREAST" "BT20_BREAST" "BT474_BREAST" "BT549_BREAST" "CAL851_BREAST" "CALU3_LUNG" "CAMA1_BREAST" "EFM19_BREAST"
[10] "HCC1187_BREAST" "HCC1395_BREAST" "HCC1569_BREAST" "HCC1806_BREAST" "HCC1954_BREAST" "HCC70_BREAST" "HMC18_BREAST" "HS578T_BREAST" "HS739T_BREAST"
[19] "MCF7_BREAST" "MDAMB415_BREAST" "MDAMB436_BREAST" "MDAMB453_BREAST" "MDAMB468_BREAST" "SKBR3_BREAST" "T47D_BREAST" "UACC812_BREAST" "ZR751_BREAST"
[28] "ZR7530_BREAST"

Furthermore the unlist() function executed on the assays only returns the unexpected result, and not all the expected assays. We expect only BREAST and should see more entries. Instead we see PROSTATE and LUNG.

unlist(muts@assays)

GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand

22RV1_PROSTATE.BRCA2 chr13 [32954022, 32954023] + CALU3_LUNG.BRCA1 chr17 [41245233, 41245233] + ------- seqinfo: 23 sequences from hg19 genome; no seqlengths

Additionally the sparseAssay function described in the documentation did not seem to work on the RaggedExperiment.

sparseAssay(muts)

Error in .assay_i(x, i) : 'length(assays(x))' is 0

For more information about the session and versions:

sessionInfo()

R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] Biobase_2.37.2 RaggedExperiment_1.1.4 GenomicRanges_1.29.12 GenomeInfoDb_1.13.4 IRanges_2.11.12
[6] S4Vectors_0.15.6 BiocGenerics_0.23.0 MultiAssayExperiment_1.3.31

LiNk-NY commented 7 years ago

Hi Jason, @jason-weirather I was reorganizing the datasets a bit and I moved ccleMAEO.rds to https://s3.amazonaws.com/multiassayexperiments/example/ccleMAEO.rds I will look into your issue shortly.

Regards, Marcel

LiNk-NY commented 7 years ago

Hi Jason, @jason-weirather

This is the intended behavior of RaggedExperiment and there is nothing wrong with the infrastructure the data is in. RaggedExperiment works with row and column indices to create a matrix-like representation of ragged ranges. The @assays internal representation is not intended for end-users. This will show you all the ranges that were available in the RaggedExperiment before subsetting.

The issue lies with the data in that there are no metadata columns present in the data.

After further inspection, the original RangedRaggedAssay object did not contain any data elements to begin with.

## Note. class marked for deprecation
$Mutations
RangedRaggedAssay with 61534 disjoint ranges, 451 samples, and 0 data elements

Therefore, you can't represent any of the data in assay form and the error, Error in .assay_i(x, i) : 'length(assays(x))' is 0, is appropriate.

I will be working on rebuilding this MultiAssayExperiment object with newer data from the CCLE portal. You can find the code at: https://github.com/waldronlab/MultiAssayExperiment-CCLE

Best regards, Marcel

jason-weirather commented 7 years ago

Thanks for looking into this Marcel @LiNk-NY. I look forward to seeing your examples of how to use metadata. It seems the underlying issue of extraneous data being carried along after slicing remains, and I may suggest that prefixing private variables/slots with something like '.' may make those stand out better as intended internal use only. A similar practice is done in python and it seems like its made use of in the R community too.

https://stackoverflow.com/questions/10755509/public-and-private-slots-in-r

Thank you!

Jason

lwaldron commented 7 years ago

I'm not sure where to cite this, but in Bioconductor data classes - at least core classes, @ slots are never intended for direct access by users. The APIs are designed around methods that access slots in the intended ways. There are a few examples of Bioc data structures that use .-prefixed slot names, but not many. In that terminology, I would say Bioconductor slots are always "private". "Public" slots would be indicated by an accessor function of the same name that return the contents of that slot.

lwaldron commented 7 years ago

(And I agree that more examples of how to use the RaggedExperiment assay metadata will be helpful.)