Closed kevinrue closed 1 year ago
Hi Kevin,
The title of the issue:
Feature request: keep rownames(se) and rownames(rowData(se)) in sync
But isn't it what happens in all the examples that you show above?
To be clear, it does apply to the overall rowData() but not nested DataFrames within it.
I'm not convinced that the rowData()
getter has any business in touching the individual columns of the returned DataFrame. The user can have all kinds of things in these columns, a DataFrame or data.frame being just one possibility amongst many others:
X <- setNames(11:15, LETTERS[1:5])
Y <- setNames(21:25, letters[1:5])
Z <- data.frame(x1=311:315, x2=321:325, row.names=letters[22:26])
T <- Rle(41:45)
row_data <- DataFrame(X=X, Y=Y, Z=I(Z), T=T, row.names=paste0("feature", 1:5))
row_data
# DataFrame with 5 rows and 4 columns
# X Y Z T
# <integer> <integer> <data.frame> <Rle>
# feature1 11 21 311:321 41
# feature2 12 22 312:322 42
# feature3 13 23 313:323 43
# feature4 14 24 314:324 44
# feature5 15 25 315:325 45
se <- SummarizedExperiment(assays=list(count=matrix(rpois(15, lambda=9), nrow=5)), rowData=row_data)
rowData(se)$X
# A B C D E
# 11 12 13 14 15
rowData(se)$Y
# a b c d e
# 21 22 23 24 25
rowData(se)$Z
# x1 x2
# v 311 321
# w 312 322
# x 313 323
# y 314 324
# z 315 325
rowData(se)$Z$x2
# [1] 321 322 323 324 325
rowData(se)$T
# integer-Rle of length 5 with 5 runs
# Lengths: 1 1 1 1 1
# Values : 41 42 43 44 45
The original columns are intact and IMO that's how it should be.
Let's keep in mind that:
The individual columns can be anything and trying to set the names on any of them to the rownames of the SummarizedExperiment object is not even guaranteed to work:
names(rowData(se)$T) <- rownames(se)
# Error in names(rowData(se)$T) <- rownames(se) :
# invalid to use names()<- on an S4 object of class 'Rle'
The nesting could go deep in the rowData DataFrame, with possibly some data.frames, tibbles, data.tables involved at various levels of the nesting tree. If we were to start the business of keeping the rownames of the data-frame-like objects nested within rowData(se)
, where should we stop? This could quickly become very inefficient.
Finally note that there would be no reason to not apply the same logic to colData(se)
or to the metadata columns of a Vector derivative:
gr <- GRanges("chr1", IRanges(start=101:105, width=10, names=rownames(se)))
mcols(gr) <- row_data
gr
# GRanges object with 5 ranges and 4 metadata columns:
# seqnames ranges strand | X Y Z T
# <Rle> <IRanges> <Rle> | <integer> <integer> <data.frame> <Rle>
# feature1 chr1 101-110 * | 11 21 311:321 41
# feature2 chr1 102-111 * | 12 22 312:322 42
# feature3 chr1 103-112 * | 13 23 313:323 43
# feature4 chr1 104-113 * | 14 24 314:324 44
# feature5 chr1 105-114 * | 15 25 315:325 45
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
mcols(gr)$X
# A B C D E
# 11 12 13 14 15
mcols(gr))$Y
# a b c d e
# 21 22 23 24 25
mcols(gr)$Z
# x1 x2
# v 311 321
# w 312 322
# x 313 323
# y 314 324
# z 315 325
mcols(gr)$Z$x2
# [1] 321 322 323 324 325
mcols(gr)$T
# integer-Rle of length 5 with 5 runs
# Lengths: 1 1 1 1 1
# Values : 41 42 43 44 45
Again the original columns are intact. I don't think there was ever any expectation that names(gr)
would be set on them.
Hope this makes sense,
H.
Thanks Hervé.
It does make sense. I could sense myself that dealing with nesting would open a can of worms, as per your "where does it end" point.
I had to enquire here, as any "fix" to the feature request (made in the iSEEde
repository) would need to be implemented in SummarizedExperiment
, not any downstream package.
Just to clarify or rephrase. I think we are talking about the same thing in different ways, but just to be clear, as I see how the title of the issue can be misleading:
The feature request is not so much about "touching" the individual columns themselves, but rather setting the rownames of the object created when they are accessed (either to be displayed on the console or assigned to a new object).
My link to setReplaceMethod("dimnames", c("SummarizedExperiment", "list")
was actually a red herring.
What the feature request is really about is the getter method showMethods("rowData", classes = "SummarizedExperiment", includeDefs = TRUE)
.
Following the calls, rowData(se)
ultimately invokes showMethods("elementMetadata", classes = "Vector", includeDefs = TRUE)
# Function: elementMetadata (package S4Vectors)
# x="Vector"
# function (x, use.names = TRUE, ...)
# {
# if (!isTRUEorFALSE(use.names))
# stop("'use.names' must be TRUE or FALSE")
# ans <- updateObject(x@elementMetadata, check = FALSE)
# if (use.names && !is.null(ans))
# rownames(ans) <- names(x)
# ans
# }
My focus here is the line rownames(ans) <- names(x)
, which in the case of SummarizedExperiment
fetches the slot@NAMES
and assigns it to the rownames
of the rowData
.
Now, logically, doing that on the overall DataFrame
does not go recursively into the individual columns, does not check whether they are nested dataframe-like structures, and does not set their own rownames
, which goes back to our agreed point "where does it end".
The feature request arises from the fact that, once rowData(se)
has returned that overall DataFrame
, and the [[
method kicks in to extract a DataFrame
nested in a column of rowData
, the rownames of that nested DataFrame
are returned as they were set in the original DataFrame nested in that column, while the rownames of that nested DataFrame really should be set to match the rownames of the overall rowData
DataFrame, for consistency, especially if the rownames(se)
were updated after that nested DataFrame was added.
See my use case, where I nest a DataFrame
of differential expression results within a column of rowData
, here: https://github.com/iSEE/iSEEde/blob/76e835f4c5756e057d04337c2528eddcc7514028/R/utils-SummarizedExperiment.R#L20
In iSEEde, I initially synchronise the rownames of the nested DataFrame with those of the overall object. However, if users update the rownames of the overall object, the change only applies to the rownames of the overall rowData, but not the rownames of the nested DataFrame of differential expression results. That last bit is the feature request I received, and that makes sense to me.
I understand that we're still facing the "where does it end" question, but I hope that is a bit clearer.
Best, Kevin
I see how the title of the issue can be misleading
It's not only misleading, it's incorrect. It's an important SE feature that rownames(se)
and rownames(rowData(se))
are kept in sync all the time, and if you find a situation where this is not the case, then it's a bug. But that's not what this feature request is about. IIUC it's about keeping rownames(se)
and the rownames of all the nested DataFrames in rowData(se)
in sync. Am I correct?
The feature request arises from the fact that, once
rowData(se)
has returned that overall DataFrame, and the[[
method kicks in to extract a DataFrame nested in a column of rowData, the rownames of that nested DataFrame are returned as they were set in the original DataFrame nested in that column, while the rownames of that nested DataFrame really should be set to match the rownames of the overall rowData DataFrame, for consistency.
That is indeed my understanding of your request. But again, I don't think it's a good idea, and I explained why.
But maybe you are suggestting that just because a column in the rowData(se)
DataFrame happens to be a DataFrame itself, then we should treat it differently than the non-DataFrame columns? If this is the case, should this exception also apply to other columns that are data-frame-like objects like data.frames, data.tables, tibbles, etc...? Should we also extend this exceptional treatment of data-frame-like columns to the DataFrames returned by colData(se)
or mcols(x)
(x
being a Vector derivative)? Also, what do you suggest to do if there's more than one level of nesting?
These are important considerations that I raised already. Please let me know what's your take on them.
Thanks, H.
I've replied in the linked iSEEde
issue and in summary: I agree entirely with you. It took a bit longer for my common sense to kick in than I would have liked, but I got there in the end :)
IIUC it's about keeping rownames(se) and the rownames of all the nested DataFrames in rowData(se) in sync. Am I correct?
100% correct
These are important considerations that I raised already. Please let me know what's your take on them.
Yup. Like I said, it took me one more night to sleep on it, but I did get to the same conclusion myself while you wrote your latest message. Happy to close the issue without any fix needed. Nested data.frame-like structures are the responsibility of the packages that create/manage them.
See https://github.com/iSEE/iSEEde/issues/19
Briefly, the updated
@NAMES
slot is not used to setrownames
when accessing nested DataFrames inrowData
/mcols()
/elementMetadata()
.To be clear, it does apply to the overall
rowData()
but not nested DataFrames within it.Thoughts?
Thanks in advance
Created on 2023-03-23 with reprex v2.0.2