Bioconductor / SummarizedExperiment

A container (S4 class) for matrix-like assays
https://bioconductor.org/packages/SummarizedExperiment
33 stars 9 forks source link

Consider assayData DataFrame for assays #62

Closed jmw86069 closed 2 years ago

jmw86069 commented 2 years ago

I asked on Twitter then Bioc Stack, then @vjcitn suggested I ask here. :)

“using SummarizedExperiment: I want something like assayData() to hold tabular data about each matrix in assays(), one row per assay. When storing more than one assay matrix, I encode too much into the assay name.

Has this idea been discussed?”

I basically want an empty DataFrame with slot “assayData” with one row per entry in the assays slot. (I see your post asking about assay name constraints, that could be useful or necessary here as well.)

I can add some driving use cases in the next post.

Two basic utilities:

jmw86069 commented 2 years ago

I was thinking I could test a small R package that populates assayData as a DataFrame in the metadata slot. I would try to implement a 3-dimension subset method that would also keep the assayData in sync with the assays.

I’m not sure how to intercept adding an assay, I guess a custom function like: addSEassay(se, assaylist, assayData=NULL)

It’s tricky to add an empty row to assayData. If user supplies assayData the simplest approach would be to require it to contain the same colnames already in the se object.

Is there a similar utility to add a row to rowData or colData? I don’t remember seeing one.

vjcitn commented 2 years ago

I think it is a good idea to produce a small demonstration of what you are hoping for, with the extra information going into the metadata element. Define your operations as plain R functions and then we can evaluate what infrastructure changes and methods might be warranted.

hpages commented 2 years ago

I don't know how people expect to have a productive discussion, especially a technical one, on Twitter :roll_eyes:

@jmw86069 The assays() getter returns the assays in a SimpleList which is something that can hold metadata columns (like any other Vector derivative):

library(SummarizedExperiment)

se <- SummarizedExperiment(list(A1=matrix(1:12, ncol=3), A2=matrix(101:112, ncol=3)))

assays(se)
# List of length 2
# names(2): A1 A2

class(assays(se))
# [1] "SimpleList"
# attr(,"package")
# [1] "S4Vectors"

mcols(assays(se)) <- DataFrame(assayid=c("id1", "id2"), isnormalized=c(TRUE, FALSE), otherstuff=c("X", "Y"))
mcols(assays(se))
# DataFrame with 2 rows and 3 columns
#        assayid isnormalized  otherstuff
#    <character>    <logical> <character>
# A1         id1         TRUE           X
# A2         id2        FALSE           Y

Is this what you are after?

jmw86069 commented 2 years ago

Wow that's actually very helpful, thank you! @hpages

You're right about tech discussions on Twitter, but it did (eventually) get enough visibility for a response! Also, I didn't know where to ask at first. I was hoping something existed already, and at least that part was correct.

I've been a longtime user of Bioc classes, and of SummarizedExperiment. It never occurred to me that List would also have metadata columns. That's my fault.

The only little issue is that adding to assays(se) <- does not update the mcols(assays(se)) and so it has to be done in a second step. Not a big deal, I can work with that.

For my purposes, I don't have a driving reason to request any changes to the infrastructure, I'll close this issue.

hpages commented 2 years ago

The only little issue is that adding to assays(se) <- does not update the mcols(assays(se))

Not sure what you mean by "adding to assays(se) <-".

With assays(se) <-

assays(se) <- c(assays(se), rev(assays(se)))
mcols(assays(se))
# DataFrame with 4 rows and 3 columns
#        assayid isnormalized  otherstuff
#    <character>    <logical> <character>
# A1         id1         TRUE           X
# A2         id2        FALSE           Y
# A2         id2        FALSE           Y
# A1         id1         TRUE           X

and with assay(se, i) <-:

assay(se, 5L) <- matrix(201:212, ncol=3)
mcols(assays(se))
# DataFrame with 5 rows and 3 columns
#        assayid isnormalized  otherstuff
#    <character>    <logical> <character>
# A1         id1         TRUE           X
# A2         id2        FALSE           Y
# A2         id2        FALSE           Y
# A1         id1         TRUE           X
#             NA           NA          NA

Looks fine to me.

Please open a new issue and provide details if this doesn't work for you or if you were expecting something else.

H.

jmw86069 commented 2 years ago

Yes, I should have clarified - I think current behavior is working as expected now that I understand about mcols here. :) All good.

The "only little issue" was practical for me: adding an assay as a numeric matrix directly (shown in your second example) creates NA values in the mcols DataFrame. Not a problem at all, just a thing for me to handle accordingly.

I like your first example, which requires creating a SimpleList with the new assay. I understand this is how to add a new assay with metadata:

# A. start as before
se <- SummarizedExperiment(list(A1=matrix(1:12, ncol=3),
   A2=matrix(101:112, ncol=3)))
mcols(assays(se)) <- DataFrame(assayid=c("id1", "id2"),
   isnormalized=c(TRUE, FALSE),
   otherstuff=c("X", "Y"))

# B. new assay matrix
new_matrix <- assays(se)[[1]] + 100
# new mcols for this assay matrix
new_mcols <- DataFrame(assayid="id1_plus10",
   isnormalized=TRUE,
   otherstuff="Z")
# make a SimpleList for the new assay
new_assays <- SimpleList(A1_plus10=new_matrix)
# add mcols in a second step
mcols(new_assays) <- new_mcols

# C. add the assay
assays(se) <- c(assays(se), new_assays);
mcols(assays(se))
# DataFrame with 3 rows and 3 columns
#               assayid isnormalized  otherstuff
#           <character>    <logical> <character>
# A1                id1         TRUE           X
# A2                id2        FALSE           Y
# A1_plus10  id1_plus10         TRUE           Z

The SimpleList(...) step is interesting, there is no constructor that also includes the metadata. For example something like SimpleList(..., mcols=DataFrame())

Do you recommend the two step process, or is there a fancier approach?

new_assays <- SimpleList(new_assay=new_assay)
mcols(new_assays) <- new_mcols