Bioconductor / SummarizedExperiment

A container (S4 class) for matrix-like assays
https://bioconductor.org/packages/SummarizedExperiment
33 stars 9 forks source link

Feature request: keep rownames(se) and rownames() of DataFrame nested in rowData(se) in sync #71

Closed kevinrue closed 1 year ago

kevinrue commented 1 year ago

See https://github.com/iSEE/iSEEde/issues/19

Briefly, the updated @NAMES slot is not used to set rownames when accessing nested DataFrames in rowData/mcols()/elementMetadata().

To be clear, it does apply to the overall rowData() but not nested DataFrames within it.

Thoughts?

Thanks in advance

# Setup ----

suppressPackageStartupMessages({
    library(SummarizedExperiment)
})
nfeatures <- 4
nsamples <- 2
feature_names <- head(letters, nfeatures)
sample_names <- head(LETTERS, nsamples)
count_matrix <- matrix(
    data = seq(1, nfeatures * nsamples),
    nrow = nfeatures,
    ncol = nsamples,
    dimnames = list(feature_names, sample_names)
)
se <- SummarizedExperiment(
    assays = list(counts = count_matrix),
    rowData = DataFrame(row.names = feature_names, value = seq(1, nfeatures)),
    colData = DataFrame(row.names = sample_names, value = seq(1, nsamples))
)
se
#> class: SummarizedExperiment 
#> dim: 4 2 
#> metadata(0):
#> assays(1): counts
#> rownames(4): a b c d
#> rowData names(1): value
#> colnames(2): A B
#> colData names(1): value
rownames(se)
#> [1] "a" "b" "c" "d"
rowData(se)
#> DataFrame with 4 rows and 1 column
#>       value
#>   <integer>
#> a         1
#> b         2
#> c         3
#> d         4

# Renaming rownames(se) does rename rownames(rowData(se)) ----

rownames(se) <- paste0("test1_row", seq(1, nrow(se)))
rownames(se)
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"
rownames(rowData(se))
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"
se@NAMES
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"

# However, the renaming does not apply recursively to nested DataFrames inside rowData ----

## Creating a nested DataFrame does not automatically check nor apply rownames(se) to rownames(rowData(se)[["nested"]])

rowData(se)[["nested"]] <- DataFrame(
    Nested1 = seq(1, nrow(se)),
    Nested2 = seq(1, nrow(se))
)
rownames(se) # up-to-date
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"
rownames(rowData(se)) # up-to-date
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"
rownames(rowData(se)[["nested"]]) # NULL (!)
#> NULL
se@NAMES # up-to-date
#> [1] "test1_row1" "test1_row2" "test1_row3" "test1_row4"

## Renaming rownames(se) renames rownames(rowData(se)) but not rownames(rowData(se)[["nested"]])

rownames(se) <- paste0("test2_row", seq(1, nrow(se)))
rownames(se) # up-to-date
#> [1] "test2_row1" "test2_row2" "test2_row3" "test2_row4"
rownames(rowData(se)) # up-to-date
#> [1] "test2_row1" "test2_row2" "test2_row3" "test2_row4"
rownames(rowData(se)[["nested"]]) # NULL (!)
#> NULL
se@NAMES # up-to-date
#> [1] "test2_row1" "test2_row2" "test2_row3" "test2_row4"

## Renaming rownames(rowData(se)) ... does nothing! (makes sense though, would be bad practice)

rownames(rowData(se)) <- paste0("test3_row", seq(1, nrow(se)))
rownames(se) # out of date (!)
#> [1] "test2_row1" "test2_row2" "test2_row3" "test2_row4"
rownames(rowData(se)) # out of date (!)
#> [1] "test2_row1" "test2_row2" "test2_row3" "test2_row4"
rownames(rowData(se)[["nested"]]) # NULL (!)
#> NULL

Created on 2023-03-23 with reprex v2.0.2

hpages commented 1 year ago

Hi Kevin,

The title of the issue:

Feature request: keep rownames(se) and rownames(rowData(se)) in sync

But isn't it what happens in all the examples that you show above?

To be clear, it does apply to the overall rowData() but not nested DataFrames within it.

I'm not convinced that the rowData() getter has any business in touching the individual columns of the returned DataFrame. The user can have all kinds of things in these columns, a DataFrame or data.frame being just one possibility amongst many others:

X <- setNames(11:15, LETTERS[1:5])
Y <- setNames(21:25, letters[1:5])
Z <- data.frame(x1=311:315, x2=321:325, row.names=letters[22:26])
T <- Rle(41:45)

row_data <- DataFrame(X=X, Y=Y, Z=I(Z), T=T, row.names=paste0("feature", 1:5))
row_data
# DataFrame with 5 rows and 4 columns
#                  X         Y            Z     T
#          <integer> <integer> <data.frame> <Rle>
# feature1        11        21      311:321    41
# feature2        12        22      312:322    42
# feature3        13        23      313:323    43
# feature4        14        24      314:324    44
# feature5        15        25      315:325    45

se <- SummarizedExperiment(assays=list(count=matrix(rpois(15, lambda=9), nrow=5)), rowData=row_data)

rowData(se)$X
#  A  B  C  D  E 
# 11 12 13 14 15 

rowData(se)$Y
#  a  b  c  d  e 
# 21 22 23 24 25 

rowData(se)$Z
#    x1  x2
# v 311 321
# w 312 322
# x 313 323
# y 314 324
# z 315 325

rowData(se)$Z$x2
# [1] 321 322 323 324 325

rowData(se)$T
# integer-Rle of length 5 with 5 runs
#   Lengths:  1  1  1  1  1
#   Values : 41 42 43 44 45

The original columns are intact and IMO that's how it should be.

Let's keep in mind that:

  1. The individual columns can be anything and trying to set the names on any of them to the rownames of the SummarizedExperiment object is not even guaranteed to work:

    names(rowData(se)$T) <- rownames(se)
    # Error in names(rowData(se)$T) <- rownames(se) : 
    #   invalid to use names()<- on an S4 object of class 'Rle'
  2. The nesting could go deep in the rowData DataFrame, with possibly some data.frames, tibbles, data.tables involved at various levels of the nesting tree. If we were to start the business of keeping the rownames of the data-frame-like objects nested within rowData(se), where should we stop? This could quickly become very inefficient.

  3. Finally note that there would be no reason to not apply the same logic to colData(se) or to the metadata columns of a Vector derivative:

    gr <- GRanges("chr1", IRanges(start=101:105, width=10, names=rownames(se)))
    mcols(gr) <- row_data
    gr
    # GRanges object with 5 ranges and 4 metadata columns:
    #            seqnames    ranges strand |         X         Y            Z     T
    #               <Rle> <IRanges>  <Rle> | <integer> <integer> <data.frame> <Rle>
    #   feature1     chr1   101-110      * |        11        21      311:321    41
    #   feature2     chr1   102-111      * |        12        22      312:322    42
    #   feature3     chr1   103-112      * |        13        23      313:323    43
    #   feature4     chr1   104-113      * |        14        24      314:324    44
    #   feature5     chr1   105-114      * |        15        25      315:325    45
    #   -------
    #   seqinfo: 1 sequence from an unspecified genome; no seqlengths
    
    mcols(gr)$X
    #  A  B  C  D  E 
    # 11 12 13 14 15 
    
    mcols(gr))$Y
    #  a  b  c  d  e 
    # 21 22 23 24 25 
    
    mcols(gr)$Z
    #    x1  x2
    # v 311 321
    # w 312 322
    # x 313 323
    # y 314 324
    # z 315 325
    
    mcols(gr)$Z$x2
    # [1] 321 322 323 324 325
    
    mcols(gr)$T
    # integer-Rle of length 5 with 5 runs
    #   Lengths:  1  1  1  1  1
    #   Values : 41 42 43 44 45

    Again the original columns are intact. I don't think there was ever any expectation that names(gr) would be set on them.

Hope this makes sense,

H.

kevinrue commented 1 year ago

Thanks Hervé.

It does make sense. I could sense myself that dealing with nesting would open a can of worms, as per your "where does it end" point.

I had to enquire here, as any "fix" to the feature request (made in the iSEEde repository) would need to be implemented in SummarizedExperiment, not any downstream package.

Just to clarify or rephrase. I think we are talking about the same thing in different ways, but just to be clear, as I see how the title of the issue can be misleading:

The feature request is not so much about "touching" the individual columns themselves, but rather setting the rownames of the object created when they are accessed (either to be displayed on the console or assigned to a new object).

My link to setReplaceMethod("dimnames", c("SummarizedExperiment", "list") was actually a red herring.

What the feature request is really about is the getter method showMethods("rowData", classes = "SummarizedExperiment", includeDefs = TRUE).

Following the calls, rowData(se) ultimately invokes showMethods("elementMetadata", classes = "Vector", includeDefs = TRUE)

# Function: elementMetadata (package S4Vectors)
# x="Vector"
# function (x, use.names = TRUE, ...) 
# {
#     if (!isTRUEorFALSE(use.names)) 
#         stop("'use.names' must be TRUE or FALSE")
#     ans <- updateObject(x@elementMetadata, check = FALSE)
#     if (use.names && !is.null(ans)) 
#         rownames(ans) <- names(x)
#     ans
# }

My focus here is the line rownames(ans) <- names(x), which in the case of SummarizedExperiment fetches the slot@NAMES and assigns it to the rownames of the rowData.

Now, logically, doing that on the overall DataFrame does not go recursively into the individual columns, does not check whether they are nested dataframe-like structures, and does not set their own rownames, which goes back to our agreed point "where does it end".

The feature request arises from the fact that, once rowData(se) has returned that overall DataFrame, and the [[ method kicks in to extract a DataFrame nested in a column of rowData, the rownames of that nested DataFrame are returned as they were set in the original DataFrame nested in that column, while the rownames of that nested DataFrame really should be set to match the rownames of the overall rowData DataFrame, for consistency, especially if the rownames(se) were updated after that nested DataFrame was added.

See my use case, where I nest a DataFrame of differential expression results within a column of rowData, here: https://github.com/iSEE/iSEEde/blob/76e835f4c5756e057d04337c2528eddcc7514028/R/utils-SummarizedExperiment.R#L20

In iSEEde, I initially synchronise the rownames of the nested DataFrame with those of the overall object. However, if users update the rownames of the overall object, the change only applies to the rownames of the overall rowData, but not the rownames of the nested DataFrame of differential expression results. That last bit is the feature request I received, and that makes sense to me.

I understand that we're still facing the "where does it end" question, but I hope that is a bit clearer.

Best, Kevin

hpages commented 1 year ago

I see how the title of the issue can be misleading

It's not only misleading, it's incorrect. It's an important SE feature that rownames(se) and rownames(rowData(se)) are kept in sync all the time, and if you find a situation where this is not the case, then it's a bug. But that's not what this feature request is about. IIUC it's about keeping rownames(se) and the rownames of all the nested DataFrames in rowData(se) in sync. Am I correct?

The feature request arises from the fact that, once rowData(se) has returned that overall DataFrame, and the [[ method kicks in to extract a DataFrame nested in a column of rowData, the rownames of that nested DataFrame are returned as they were set in the original DataFrame nested in that column, while the rownames of that nested DataFrame really should be set to match the rownames of the overall rowData DataFrame, for consistency.

That is indeed my understanding of your request. But again, I don't think it's a good idea, and I explained why.

But maybe you are suggestting that just because a column in the rowData(se) DataFrame happens to be a DataFrame itself, then we should treat it differently than the non-DataFrame columns? If this is the case, should this exception also apply to other columns that are data-frame-like objects like data.frames, data.tables, tibbles, etc...? Should we also extend this exceptional treatment of data-frame-like columns to the DataFrames returned by colData(se) or mcols(x) (x being a Vector derivative)? Also, what do you suggest to do if there's more than one level of nesting?

These are important considerations that I raised already. Please let me know what's your take on them.

Thanks, H.

kevinrue commented 1 year ago

I've replied in the linked iSEEde issue and in summary: I agree entirely with you. It took a bit longer for my common sense to kick in than I would have liked, but I got there in the end :)

IIUC it's about keeping rownames(se) and the rownames of all the nested DataFrames in rowData(se) in sync. Am I correct?

100% correct

These are important considerations that I raised already. Please let me know what's your take on them.

Yup. Like I said, it took me one more night to sleep on it, but I did get to the same conclusion myself while you wrote your latest message. Happy to close the issue without any fix needed. Nested data.frame-like structures are the responsibility of the packages that create/manage them.