Open lmweber opened 2 years ago
If DropletUtils::read10xCounts()
already handle this, we could extend the arguments of read10xVisium()
with a ...
and pass it the to the read10xCounts
function that we already use for loading the counts.
Hi,
I'm not a fan of pasting numbers since you might want to re-order the samples later and thus "sample 1" might become "sample 10". This happened to us in one project (Lukas: Visium IF AD). Hence they key
part.
Initially the key
we use in spatialLIBD
was for enabling client-side highlighting as done at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/app_server.R#L519-L520. Now we use it also for cluster_import()
https://github.com/LieberInstitute/spatialLIBD/blob/master/R/cluster_import.R#L49 and cluster_export()
, and any time we need to match info.
Best, Leo
@lcolladotor I don't think I have sufficient context to understand your concern, but isn't the spe$sample_id
serving as a re-nameable key?
Well, I guess that you could already re-run the code that pastes numbers based on the sample ids to fix the issue I have. Here's a full reprex
of what I mean.
## Some example data, imagine we read sample_id C, then B, then A
df <- data.frame(
sample_id = rep(LETTERS[3:1], each = 5),
barcode = rep(c("AAA", "AAC", "ABC", "ACD", "ADC"), 3)
)
## Code that appends the number based on the unique sample_id
add_key <- function(df) {
with(df, paste0(barcode, "-", as.integer(factor(sample_id, levels = unique(sample_id)))))
}
df$key_initial <- add_key(df)
## Code that adds the actual sample_id instead of a number
df$key_sample_id <- with(df, paste0(barcode, "_", sample_id))
## Note how the first sample_id is C, so it gets a -1 at the end
## then B, so it gets a -2, then A so it gets a -3 at the end.
unique(df$sample_id)
#> [1] "C" "B" "A"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#> meaning
#> sample_id 1 2 3
#> A 0 0 5
#> B 0 5 0
#> C 5 0 0
## Now we see what it looks like with sample_id
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#> meaning
#> sample_id A B C
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5
## Re-order data after having created it. Maybe we want to have sample_id's
## listed as A, then B, then C now.
df <- df[c(11:15, 6:10, 1:5), ]
## After having re-ordered the data, now the first sample_id gets a -3 which
## can be confusing. Since the -1 lost the meaning of being the first
## sample_id.
unique(df$sample_id)
#> [1] "A" "B" "C"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#> meaning
#> sample_id 1 2 3
#> A 0 0 5
#> B 0 5 0
#> C 5 0 0
## Re-running the code that adds the -Number at the end restores the meaning
## of -1 being the first sample.
df$key_new <- add_key(df)
with(df, table(sample_id, meaning = gsub(".*-", "", key_new)))
#> meaning
#> sample_id 1 2 3
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5
## This hasn't changed in meaning even after re-ordering the data
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#> meaning
#> sample_id A B C
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5
Created on 2022-02-22 by the reprex package (v2.0.1)
That's why I prefer to use the sample_id
from the beginning instead of numbers whose meaning might change, in the process of creating those unique spot/column identifiers in the spe
object.
FYI After my initial comment https://github.com/drighelli/SpatialExperiment/issues/100#issuecomment-1047845022 I ended up making spatialLIBD::add_key()
to reduce code duplication and add the overwrite
parameter https://github.com/LieberInstitute/spatialLIBD/blob/master/R/add_key.R.
I don't much care if it's the sample_id
or an autogenerated number that is used as a prefix to ensure unique colnames.
Getting a little off-topic but related to @drighelli's comment:
If DropletUtils::read10xCounts() already handle this, we could extend the arguments of read10xVisium() with a ... and pass it the to the read10xCounts function that we already use for loading the counts.
As a user familiar with DropletUtils::read10xCounts()
, I was sort of expecting SpatialExperiment::read10xVisium()
to return much the same thing as DropletUtils::read10Counts()
but with an 'upgrade' to a SPE containing the necessary image-related data.
The slight differences between the 2 function's arguments and what they return creates slight friction for the user.
There are perhaps minor issues (and some are probably too late to change) but I note a few other things that struck me in my first use of SpatialExperiment
suppressPackageStartupMessages(library(SpatialExperiment))
suppressPackageStartupMessages(library(DropletUtils))
dir <- system.file(
file.path("extdata", "10xVisium"),
package = "SpatialExperiment")
sample_ids <- c("section1", "section2")
samples <- file.path(dir, sample_ids)
# SPE uses `sample_id` whereas SCE uses `sample.names`.
spe <- read10xVisium(
samples = samples,
sample_id = sample_ids,
type = "sparse",
data = "raw",
images = "lowres",
load = FALSE)
sce <- read10xCounts(
samples = file.path(samples, "raw_feature_bc_matrix"),
sample.names = sample_ids,
# NOTE: col.names = FALSE is the default
col.names = TRUE,
type = "sparse")
# SPE has duplicated colnames, SCE doesn't
anyDuplicated(colnames(spe))
#> [1] 51
anyDuplicated(colnames(sce))
#> [1] 0
# SCE colnames are prefixed by a number (non-default option) but SPE aren't
head(colnames(spe))
#> [1] "AAACAACGAATAGTTC-1" "AAACAAGTATCTCCCA-1" "AAACAATCTACTAGCA-1"
#> [4] "AAACACCAATAACTGC-1" "AAACAGAGCGACTCCT-1" "AAACAGCTTTCAGAAG-1"
head(colnames(sce))
#> [1] "1_AAACAACGAATAGTTC-1" "1_AAACAAGTATCTCCCA-1" "1_AAACAATCTACTAGCA-1"
#> [4] "1_AAACACCAATAACTGC-1" "1_AAACAGAGCGACTCCT-1" "1_AAACAGCTTTCAGAAG-1"
tail(colnames(spe))
#> [1] "AAAGGTAAGCTGTACC-1" "AAAGGTCAACGACATG-1" "AAAGTAGCATTGCTCA-1"
#> [4] "AAAGTCACTGATGTAA-1" "AAAGTCGACCCTCAGT-1" "AAAGTGCCATCAATTA-1"
tail(colnames(sce))
#> [1] "2_AAAGGTAAGCTGTACC-1" "2_AAAGGTCAACGACATG-1" "2_AAAGTAGCATTGCTCA-1"
#> [4] "2_AAAGTCACTGATGTAA-1" "2_AAAGTCGACCCTCAGT-1" "2_AAAGTGCCATCAATTA-1"
# coldata differs (in both colnames and columns)
colData(spe)
#> DataFrame with 99 rows and 1 column
#> sample_id
#> <character>
#> AAACAACGAATAGTTC-1 section1
#> AAACAAGTATCTCCCA-1 section1
#> AAACAATCTACTAGCA-1 section1
#> AAACACCAATAACTGC-1 section1
#> AAACAGAGCGACTCCT-1 section1
#> ... ...
#> AAAGGTCAACGACATG-1 section2
#> AAAGTAGCATTGCTCA-1 section2
#> AAAGTCACTGATGTAA-1 section2
#> AAAGTCGACCCTCAGT-1 section2
#> AAAGTGCCATCAATTA-1 section2
colData(sce)
#> DataFrame with 100 rows and 2 columns
#> Sample Barcode
#> <character> <character>
#> 1_AAACAACGAATAGTTC-1 section1 AAACAACGAATAGTTC-1
#> 1_AAACAAGTATCTCCCA-1 section1 AAACAAGTATCTCCCA-1
#> 1_AAACAATCTACTAGCA-1 section1 AAACAATCTACTAGCA-1
#> 1_AAACACCAATAACTGC-1 section1 AAACACCAATAACTGC-1
#> 1_AAACAGAGCGACTCCT-1 section1 AAACAGAGCGACTCCT-1
#> ... ... ...
#> 2_AAAGGTCAACGACATG-1 section2 AAAGGTCAACGACATG-1
#> 2_AAAGTAGCATTGCTCA-1 section2 AAAGTAGCATTGCTCA-1
#> 2_AAAGTCACTGATGTAA-1 section2 AAAGTCACTGATGTAA-1
#> 2_AAAGTCGACCCTCAGT-1 section2 AAAGTCGACCCTCAGT-1
#> 2_AAAGTGCCATCAATTA-1 section2 AAAGTGCCATCAATTA-1
# rowdata differs (in both colnames and columns)
rowData(spe)
#> DataFrame with 50 rows and 1 column
#> symbol
#> <character>
#> ENSMUSG00000051951 Xkr4
#> ENSMUSG00000089699 Gm1992
#> ENSMUSG00000102343 Gm37381
#> ENSMUSG00000025900 Rp1
#> ENSMUSG00000025902 Sox17
#> ... ...
#> ENSMUSG00000025938 Slco5a1
#> ENSMUSG00000099498 Gm29283
#> ENSMUSG00000042414 Prdm14
#> ENSMUSG00000005886 Ncoa2
#> ENSMUSG00000101476 Gm29570
rowData(sce)
#> DataFrame with 50 rows and 3 columns
#> ID Symbol Type
#> <character> <character> <character>
#> ENSMUSG00000051951 ENSMUSG00000051951 Xkr4 Gene Expression
#> ENSMUSG00000089699 ENSMUSG00000089699 Gm1992 Gene Expression
#> ENSMUSG00000102343 ENSMUSG00000102343 Gm37381 Gene Expression
#> ENSMUSG00000025900 ENSMUSG00000025900 Rp1 Gene Expression
#> ENSMUSG00000025902 ENSMUSG00000025902 Sox17 Gene Expression
#> ... ... ... ...
#> ENSMUSG00000025938 ENSMUSG00000025938 Slco5a1 Gene Expression
#> ENSMUSG00000099498 ENSMUSG00000099498 Gm29283 Gene Expression
#> ENSMUSG00000042414 ENSMUSG00000042414 Prdm14 Gene Expression
#> ENSMUSG00000005886 ENSMUSG00000005886 Ncoa2 Gene Expression
#> ENSMUSG00000101476 ENSMUSG00000101476 Gm29570 Gene Expression
Created on 2022-02-23 by the reprex package (v2.0.1)
One other minor thing I noticed with SPE is that it's tab-completion of the $
-method is different from SCE (SPE seems to tab-complete over rows whereas SCE tab-completes over columns of colData)
Re
$
-autocompletion @PeteHaitch I noticed this as well! My only guess is that this is caused by our over-writing thecolData
replacement method, but I don't understand why this would have such an effect :/ The$
accession still works, but I find it very annoying to have genes pop up versus the available metadata... Not sure how to fix this at the moment.Re duplicated column names Sorry, I got a bit lost. Do we have a consensus? I think my preference is for using
sample_id
s as a prefix vs. appending numbers. The only downside of this is if someone renames them (i.e., replaces thesample_id
s in thecolData
). Then again, the same discrepancy would arise withread10xCounts
I believe.
Maybe worth discussing as well: Should we always do this, even if there's just one sample? The reasoning would be that if someone were to create two independent SPEs andcbind
them, barcodes would again be duplicated. Alternatively, this could be handled by thecbind
method, which also assures sample IDs are unique. E.g., if two SPEs both containsample01
(default),cbind
would rename one (and could also relabel the barcodes).
Thanks @PeteHaitch for a so exhaustive overview of this issue.
In my personal opinion, I like the idea to have a consistency between the read10xVisium
and the read10xCounts
, in particular thinking about the bigger Bioconductor project where all the objects should behave in a similar way.
About the rowData
, we can surely remove the ID
column once we set the rowData
rownames
, it's redundant.
But this doesn't mean that if we have additional information, such as Type
, we shouldn't keep them...
Also, I don't know if the community has "standards" for the colnames
naming, in terms of upper/lowercase, but it would be nice to discuss it.
About the $-autocompletion, I don't know why R-studio acts like that, but when using the R console it works fine to me!
About the $-autocompletion, I don't know why R-studio acts like that, but when using the R console it works fine to me!
Interesting, I'd meant to check that and I now see that in command-line R it works as you observe, so it's an RStudio-specific thing but would still be good to resolve.
spe
not having spe$Barcode
I would benefit from having it listed, despite the duplication in information. So I +1 his request about it. I would then change colnames(spe)
to spe$Barcode
at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/add_key.R#L45.
spe$sample_id
vs spe$Sample
If you change the spe$sample_id
to say spe$Sample
, I think that it would be nice that the updateObject()
did this since it'd be easy to miss this change. Though like Pete hinted at, maybe it's too late for that one.
Sounds like you also like using the sample_id
instead of a number. Though it would be different from read10xCounts()
, but I vote for living with that difference.
I think that you can let people decide how to rename things instead of doing it for them with cbind()
though. So I wouldn't necessarily try to do too much with cbind()
, though that might be me thinking that it might be going too far to help users who at that point might have changed things in their objects after using read10xVisium()
. On the other side, it's like re-using the spe$Barcode
(if it's added) and spe$sample_id
(or spe$Sample
) info which would be ok if users know that will happen and colnames(spe)
will be overwritten by cbind()
. Hm.....
I actually decided to overwrite what's included, so I personally don't care. I do so at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/read10xVisiumWrapper.R#L86-L113 where I keep even more info than what read10xVisium()
currently keeps. When I mentioned this to Lukas, the sentiment was that such a change was better left of outside SpatialExperiment
. Though well, if you wanted to include it, I could do a PR about it. Note that I locate the GTF automatically if there's a web_summary.html
file for the first sample https://github.com/LieberInstitute/spatialLIBD/blob/master/R/read10xVisiumWrapper.R#L59-L65.
That'd be nice, but well, I don't know what would need to change or hard it would be.
Thanks @PeteHaitch for the detailed examples.
I agree that some of these differences vs. the defaults from DropletUtils::read10xCounts()
could be improved, although some of them are probably also too locked in by now.
Specifically, I think we could think about changing the following in SpatialExperiment::read10xVisium()
:
sample_id
. This is slightly different to the SCE default (prefixing with 1
or 2
etc), but it ensures internal consistency with our sample_id
, and also ensures that the IDs are in the correct order in datasets with more than 10 samplesbarcode_id
as a default column in colData
(comparable to Barcode
in SCE)gene_id
as a default column in rowData
(comparable to ID
in SCE, and then this is also consistent with having the extra barcode_id
column in colData
). I never really use the other column (Type
), although if we are going for consistency and including everything from Space Ranger, then I think it would also make sense to include this one by default.On the question of whether to prefix colnames with sample_id
even in a dataset with only a single sample, I think I lean towards simplicity and only prefixing it by default if there are multiple samples, and let people rebuild their objects (instead of cbind
ing) if this is a problem later. I'm not completely sure about this though.
Thanks, @lmweber.
I'd favour using the exact same colnames
in colData
/rowData
as DropletUtils::read10xCounts()
where possible.
The Type
column is useful when the data contains multiple modalities (e.g., Gene Expression
and Antibody Capture
) to enable easy use of splitAltExps()
to split the count matrix by modality.
There's not much multimodal spatial data around yet, but it is coming.
On the question of whether to prefix colnames with sample_id even in a dataset with only a single sample, I think I lean towards simplicity and only prefixing it by default if there are multiple samples
I agree, and it's consistent with what DropletUtils::read10xCounts()
does.
This was raised by @PeteHaitch on Slack.
Currently when we create a SPE object containing multiple Visium samples with
read10xVisium()
, the column names (barcode IDs) are repeated, since 10x Genomics uses the same set of 4992 barcode IDs for each capture area.We could think about disambiguating this in
read10xVisium()
, e.g. using something likekey_id <- paste(sample_id, barcode_id, sep = "_")
for the column names, which is how we have done it in ourspatialLIBD
Shiny apps with @lcolladotor (where column names need to be unique).There is also the following (slightly different) precedent from
DropletUtils::read10xCounts()
from single-cell, also mentioned by @PeteHaitch on Slack:"If col.names=TRUE and length(sample)==1, each column is named by the cell barcode. For multiple samples, the index of each sample in samples is concatenated to the cell barcode to form the column name. This avoids problems with multiple instances of the same cell barcodes in different samples."
Note that in our case Space Ranger already appends a-1
to all barcode IDs for Visium data.