Ambiguous column names in SPE objects with multiple samples

lmweber commented 2 years ago

This was raised by @PeteHaitch on Slack.

Currently when we create a SPE object containing multiple Visium samples with read10xVisium(), the column names (barcode IDs) are repeated, since 10x Genomics uses the same set of 4992 barcode IDs for each capture area.

We could think about disambiguating this in read10xVisium(), e.g. using something like key_id <- paste(sample_id, barcode_id, sep = "_") for the column names, which is how we have done it in our spatialLIBD Shiny apps with @lcolladotor (where column names need to be unique).

There is also the following (slightly different) precedent from DropletUtils::read10xCounts() from single-cell, also mentioned by @PeteHaitch on Slack: "If col.names=TRUE and length(sample)==1, each column is named by the cell barcode. For multiple samples, the index of each sample in samples is concatenated to the cell barcode to form the column name. This avoids problems with multiple instances of the same cell barcodes in different samples." Note that in our case Space Ranger already appends a -1 to all barcode IDs for Visium data.

drighelli commented 2 years ago

If DropletUtils::read10xCounts() already handle this, we could extend the arguments of read10xVisium() with a ... and pass it the to the read10xCounts function that we already use for loading the counts.

lcolladotor commented 2 years ago

Hi,

I'm not a fan of pasting numbers since you might want to re-order the samples later and thus "sample 1" might become "sample 10". This happened to us in one project (Lukas: Visium IF AD). Hence they key part.

Initially the key we use in spatialLIBD was for enabling client-side highlighting as done at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/app_server.R#L519-L520. Now we use it also for cluster_import() https://github.com/LieberInstitute/spatialLIBD/blob/master/R/cluster_import.R#L49 and cluster_export(), and any time we need to match info.

Best, Leo

PeteHaitch commented 2 years ago

@lcolladotor I don't think I have sufficient context to understand your concern, but isn't the spe$sample_id serving as a re-nameable key?

lcolladotor commented 2 years ago

Well, I guess that you could already re-run the code that pastes numbers based on the sample ids to fix the issue I have. Here's a full reprex of what I mean.

## Some example data, imagine we read sample_id C, then B, then A
df <- data.frame(
    sample_id = rep(LETTERS[3:1], each = 5),
    barcode = rep(c("AAA", "AAC", "ABC", "ACD", "ADC"), 3)
)

## Code that appends the number based on the unique sample_id
add_key <- function(df) {
    with(df, paste0(barcode, "-", as.integer(factor(sample_id, levels = unique(sample_id)))))
}
df$key_initial <- add_key(df)

## Code that adds the actual sample_id instead of a number
df$key_sample_id <- with(df, paste0(barcode, "_", sample_id))

## Note how the first sample_id is C, so it gets a -1 at the end
## then B, so it gets a -2, then A so it gets a -3 at the end.
unique(df$sample_id)
#> [1] "C" "B" "A"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#>          meaning
#> sample_id 1 2 3
#>         A 0 0 5
#>         B 0 5 0
#>         C 5 0 0

## Now we see what it looks like with sample_id
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#>          meaning
#> sample_id A B C
#>         A 5 0 0
#>         B 0 5 0
#>         C 0 0 5

## Re-order data after having created it. Maybe we want to have sample_id's
## listed as A, then B, then C now.
df <- df[c(11:15, 6:10, 1:5), ]

## After having re-ordered the data, now the first sample_id gets a -3 which
## can be confusing. Since the -1 lost the meaning of being the first
## sample_id.
unique(df$sample_id)
#> [1] "A" "B" "C"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#>          meaning
#> sample_id 1 2 3
#>         A 0 0 5
#>         B 0 5 0
#>         C 5 0 0

## Re-running the code that adds the -Number at the end restores the meaning
## of -1 being the first sample.
df$key_new <- add_key(df)
with(df, table(sample_id, meaning = gsub(".*-", "", key_new)))
#>          meaning
#> sample_id 1 2 3
#>         A 5 0 0
#>         B 0 5 0
#>         C 0 0 5

## This hasn't changed in meaning even after re-ordering the data
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#>          meaning
#> sample_id A B C
#>         A 5 0 0
#>         B 0 5 0
#>         C 0 0 5

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

lcolladotor commented 2 years ago

That's why I prefer to use the sample_id from the beginning instead of numbers whose meaning might change, in the process of creating those unique spot/column identifiers in the spe object.

lcolladotor commented 2 years ago

FYI After my initial comment https://github.com/drighelli/SpatialExperiment/issues/100#issuecomment-1047845022 I ended up making spatialLIBD::add_key() to reduce code duplication and add the overwrite parameter https://github.com/LieberInstitute/spatialLIBD/blob/master/R/add_key.R.

PeteHaitch commented 2 years ago

I don't much care if it's the sample_id or an autogenerated number that is used as a prefix to ensure unique colnames.

PeteHaitch commented 2 years ago

Getting a little off-topic but related to @drighelli's comment:

If DropletUtils::read10xCounts() already handle this, we could extend the arguments of read10xVisium() with a ... and pass it the to the read10xCounts function that we already use for loading the counts.

As a user familiar with DropletUtils::read10xCounts(), I was sort of expecting SpatialExperiment::read10xVisium() to return much the same thing as DropletUtils::read10Counts() but with an 'upgrade' to a SPE containing the necessary image-related data. The slight differences between the 2 function's arguments and what they return creates slight friction for the user. There are perhaps minor issues (and some are probably too late to change) but I note a few other things that struck me in my first use of SpatialExperiment

suppressPackageStartupMessages(library(SpatialExperiment))
suppressPackageStartupMessages(library(DropletUtils))

dir <- system.file(
  file.path("extdata", "10xVisium"),
  package = "SpatialExperiment")
sample_ids <- c("section1", "section2")
samples <- file.path(dir, sample_ids)

# SPE uses `sample_id` whereas SCE uses `sample.names`.
spe <- read10xVisium(
  samples = samples,
  sample_id = sample_ids,
  type = "sparse",
  data = "raw",
  images = "lowres",
  load = FALSE)
sce <- read10xCounts(
  samples = file.path(samples, "raw_feature_bc_matrix"),
  sample.names = sample_ids,
  # NOTE: col.names = FALSE is the default
  col.names = TRUE,
  type = "sparse")

# SPE has duplicated colnames, SCE doesn't
anyDuplicated(colnames(spe))
#> [1] 51
anyDuplicated(colnames(sce))
#> [1] 0

# SCE colnames are prefixed by a number (non-default option) but SPE aren't
head(colnames(spe))
#> [1] "AAACAACGAATAGTTC-1" "AAACAAGTATCTCCCA-1" "AAACAATCTACTAGCA-1"
#> [4] "AAACACCAATAACTGC-1" "AAACAGAGCGACTCCT-1" "AAACAGCTTTCAGAAG-1"
head(colnames(sce))
#> [1] "1_AAACAACGAATAGTTC-1" "1_AAACAAGTATCTCCCA-1" "1_AAACAATCTACTAGCA-1"
#> [4] "1_AAACACCAATAACTGC-1" "1_AAACAGAGCGACTCCT-1" "1_AAACAGCTTTCAGAAG-1"
tail(colnames(spe))
#> [1] "AAAGGTAAGCTGTACC-1" "AAAGGTCAACGACATG-1" "AAAGTAGCATTGCTCA-1"
#> [4] "AAAGTCACTGATGTAA-1" "AAAGTCGACCCTCAGT-1" "AAAGTGCCATCAATTA-1"
tail(colnames(sce))
#> [1] "2_AAAGGTAAGCTGTACC-1" "2_AAAGGTCAACGACATG-1" "2_AAAGTAGCATTGCTCA-1"
#> [4] "2_AAAGTCACTGATGTAA-1" "2_AAAGTCGACCCTCAGT-1" "2_AAAGTGCCATCAATTA-1"

# coldata differs (in both colnames and columns)
colData(spe)
#> DataFrame with 99 rows and 1 column
#>                      sample_id
#>                    <character>
#> AAACAACGAATAGTTC-1    section1
#> AAACAAGTATCTCCCA-1    section1
#> AAACAATCTACTAGCA-1    section1
#> AAACACCAATAACTGC-1    section1
#> AAACAGAGCGACTCCT-1    section1
#> ...                        ...
#> AAAGGTCAACGACATG-1    section2
#> AAAGTAGCATTGCTCA-1    section2
#> AAAGTCACTGATGTAA-1    section2
#> AAAGTCGACCCTCAGT-1    section2
#> AAAGTGCCATCAATTA-1    section2
colData(sce)
#> DataFrame with 100 rows and 2 columns
#>                           Sample            Barcode
#>                      <character>        <character>
#> 1_AAACAACGAATAGTTC-1    section1 AAACAACGAATAGTTC-1
#> 1_AAACAAGTATCTCCCA-1    section1 AAACAAGTATCTCCCA-1
#> 1_AAACAATCTACTAGCA-1    section1 AAACAATCTACTAGCA-1
#> 1_AAACACCAATAACTGC-1    section1 AAACACCAATAACTGC-1
#> 1_AAACAGAGCGACTCCT-1    section1 AAACAGAGCGACTCCT-1
#> ...                          ...                ...
#> 2_AAAGGTCAACGACATG-1    section2 AAAGGTCAACGACATG-1
#> 2_AAAGTAGCATTGCTCA-1    section2 AAAGTAGCATTGCTCA-1
#> 2_AAAGTCACTGATGTAA-1    section2 AAAGTCACTGATGTAA-1
#> 2_AAAGTCGACCCTCAGT-1    section2 AAAGTCGACCCTCAGT-1
#> 2_AAAGTGCCATCAATTA-1    section2 AAAGTGCCATCAATTA-1

# rowdata differs (in both colnames and columns)
rowData(spe)
#> DataFrame with 50 rows and 1 column
#>                         symbol
#>                    <character>
#> ENSMUSG00000051951        Xkr4
#> ENSMUSG00000089699      Gm1992
#> ENSMUSG00000102343     Gm37381
#> ENSMUSG00000025900         Rp1
#> ENSMUSG00000025902       Sox17
#> ...                        ...
#> ENSMUSG00000025938     Slco5a1
#> ENSMUSG00000099498     Gm29283
#> ENSMUSG00000042414      Prdm14
#> ENSMUSG00000005886       Ncoa2
#> ENSMUSG00000101476     Gm29570
rowData(sce)
#> DataFrame with 50 rows and 3 columns
#>                                    ID      Symbol            Type
#>                           <character> <character>     <character>
#> ENSMUSG00000051951 ENSMUSG00000051951        Xkr4 Gene Expression
#> ENSMUSG00000089699 ENSMUSG00000089699      Gm1992 Gene Expression
#> ENSMUSG00000102343 ENSMUSG00000102343     Gm37381 Gene Expression
#> ENSMUSG00000025900 ENSMUSG00000025900         Rp1 Gene Expression
#> ENSMUSG00000025902 ENSMUSG00000025902       Sox17 Gene Expression
#> ...                               ...         ...             ...
#> ENSMUSG00000025938 ENSMUSG00000025938     Slco5a1 Gene Expression
#> ENSMUSG00000099498 ENSMUSG00000099498     Gm29283 Gene Expression
#> ENSMUSG00000042414 ENSMUSG00000042414      Prdm14 Gene Expression
#> ENSMUSG00000005886 ENSMUSG00000005886       Ncoa2 Gene Expression
#> ENSMUSG00000101476 ENSMUSG00000101476     Gm29570 Gene Expression

^{Created on 2022-02-23 by the reprex package (v2.0.1)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.2 (2021-11-01) #> os Ubuntu 20.04.4 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_AU:en #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2022-02-23 #> pandoc 2.17.1.1 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> beachmat 2.10.0 2021-10-26 [1] RSPM (R 4.1.2) #> Biobase * 2.54.0 2021-10-26 [1] RSPM (R 4.1.2) #> BiocGenerics * 0.40.0 2021-10-26 [1] RSPM (R 4.1.2) #> BiocParallel 1.28.3 2021-12-09 [1] RSPM (R 4.1.2) #> bitops 1.0-7 2021-04-24 [1] RSPM (R 4.1.0) #> cli 3.2.0 2022-02-14 [1] RSPM (R 4.1.2) #> DelayedArray 0.20.0 2021-10-26 [1] RSPM (R 4.1.2) #> DelayedMatrixStats 1.16.0 2021-10-26 [1] RSPM (R 4.1.2) #> digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.2) #> dqrng 0.3.0 2021-05-01 [1] RSPM (R 4.1.0) #> DropletUtils * 1.14.2 2022-01-09 [1] RSPM (R 4.1.2) #> edgeR 3.36.0 2021-10-26 [1] RSPM (R 4.1.2) #> evaluate 0.15 2022-02-18 [1] RSPM (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] RSPM (R 4.1.2) #> GenomeInfoDb * 1.30.1 2022-01-30 [1] RSPM (R 4.1.2) #> GenomeInfoDbData 1.2.7 2021-10-28 [1] RSPM (R 4.1.1) #> GenomicRanges * 1.46.1 2021-11-18 [1] RSPM (R 4.1.2) #> glue 1.6.1 2022-01-22 [1] RSPM (R 4.1.2) #> HDF5Array 1.22.1 2021-11-14 [1] RSPM (R 4.1.2) #> highr 0.9 2021-04-16 [1] RSPM (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0) #> IRanges * 2.28.0 2021-10-26 [1] RSPM (R 4.1.2) #> knitr 1.37 2021-12-16 [1] RSPM (R 4.1.0) #> lattice 0.20-45 2021-09-22 [4] CRAN (R 4.1.1) #> limma 3.50.1 2022-02-17 [1] Bioconductor #> locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.1.0) #> magick 2.7.3 2021-08-18 [1] CRAN (R 4.1.2) #> magrittr 2.0.2 2022-01-26 [1] RSPM (R 4.1.2) #> Matrix 1.4-0 2021-12-08 [4] CRAN (R 4.1.2) #> MatrixGenerics * 1.6.0 2021-10-26 [1] RSPM (R 4.1.2) #> matrixStats * 0.61.0 2021-09-17 [1] RSPM (R 4.1.1) #> R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.1.0) #> R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.1.0) #> R.utils 2.11.0 2021-09-26 [1] RSPM (R 4.1.0) #> Rcpp 1.0.8 2022-01-13 [1] RSPM (R 4.1.2) #> RCurl 1.98-1.6 2022-02-08 [1] RSPM (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0) #> rhdf5 2.38.0 2021-10-26 [1] RSPM (R 4.1.2) #> rhdf5filters 1.6.0 2021-10-26 [1] RSPM (R 4.1.2) #> Rhdf5lib 1.16.0 2021-10-26 [1] RSPM (R 4.1.2) #> rjson 0.2.21 2022-01-09 [1] RSPM (R 4.1.0) #> rlang 1.0.1 2022-02-03 [1] RSPM (R 4.1.2) #> rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0) #> S4Vectors * 0.32.3 2021-11-21 [1] RSPM (R 4.1.2) #> scuttle 1.4.0 2021-10-26 [1] RSPM (R 4.1.2) #> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.2) #> SingleCellExperiment * 1.16.0 2021-10-26 [1] RSPM (R 4.1.2) #> sparseMatrixStats 1.6.0 2021-10-26 [1] RSPM (R 4.1.2) #> SpatialExperiment * 1.4.0 2021-10-26 [1] Bioconductor #> stringi 1.7.6 2021-11-29 [1] RSPM (R 4.1.2) #> stringr 1.4.0 2019-02-10 [1] RSPM (R 4.1.0) #> SummarizedExperiment * 1.24.0 2021-10-26 [1] RSPM (R 4.1.2) #> withr 2.4.3 2021-11-30 [1] RSPM (R 4.1.2) #> xfun 0.29 2021-12-14 [1] RSPM (R 4.1.0) #> XVector 0.34.0 2021-10-26 [1] RSPM (R 4.1.2) #> yaml 2.3.5 2022-02-21 [1] RSPM (R 4.1.0) #> zlibbioc 1.40.0 2021-10-26 [1] RSPM (R 4.1.2) #> #> [1] /home/peter/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

One other minor thing I noticed with SPE is that it's tab-completion of the $-method is different from SCE (SPE seems to tab-complete over rows whereas SCE tab-completes over columns of colData)

HelenaLC commented 2 years ago

Re $-autocompletion @PeteHaitch I noticed this as well! My only guess is that this is caused by our over-writing the colData replacement method, but I don't understand why this would have such an effect :/ The $ accession still works, but I find it very annoying to have genes pop up versus the available metadata... Not sure how to fix this at the moment.

Re duplicated column names Sorry, I got a bit lost. Do we have a consensus? I think my preference is for using sample_ids as a prefix vs. appending numbers. The only downside of this is if someone renames them (i.e., replaces the sample_ids in the colData). Then again, the same discrepancy would arise with read10xCounts I believe.
Maybe worth discussing as well: Should we always do this, even if there's just one sample? The reasoning would be that if someone were to create two independent SPEs and cbind them, barcodes would again be duplicated. Alternatively, this could be handled by the cbind method, which also assures sample IDs are unique. E.g., if two SPEs both contain sample01 (default), cbind would rename one (and could also relabel the barcodes).

drighelli commented 2 years ago

Thanks @PeteHaitch for a so exhaustive overview of this issue.

In my personal opinion, I like the idea to have a consistency between the read10xVisium and the read10xCounts, in particular thinking about the bigger Bioconductor project where all the objects should behave in a similar way.

About the rowData, we can surely remove the ID column once we set the rowData rownames, it's redundant. But this doesn't mean that if we have additional information, such as Type, we shouldn't keep them...

Also, I don't know if the community has "standards" for the colnames naming, in terms of upper/lowercase, but it would be nice to discuss it.

About the $-autocompletion, I don't know why R-studio acts like that, but when using the R console it works fine to me! Screen Shot 2022-02-23 at 11 24 17

PeteHaitch commented 2 years ago

About the $-autocompletion, I don't know why R-studio acts like that, but when using the R console it works fine to me!

Interesting, I'd meant to check that and I now see that in command-line R it works as you observe, so it's an RStudio-specific thing but would still be good to resolve.

lcolladotor commented 2 years ago

Re @PeteHaitch's comment on `spe` not having `spe$Barcode`

I would benefit from having it listed, despite the duplication in information. So I +1 his request about it. I would then change colnames(spe) to spe$Barcode at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/add_key.R#L45.

`spe$sample_id` vs `spe$Sample`

If you change the spe$sample_id to say spe$Sample, I think that it would be nice that the updateObject() did this since it'd be easy to miss this change. Though like Pete hinted at, maybe it's too late for that one.

Re @HelenaLC duplicated column names

Sounds like you also like using the sample_id instead of a number. Though it would be different from read10xCounts(), but I vote for living with that difference.

I think that you can let people decide how to rename things instead of doing it for them with cbind() though. So I wouldn't necessarily try to do too much with cbind(), though that might be me thinking that it might be going too far to help users who at that point might have changed things in their objects after using read10xVisium(). On the other side, it's like re-using the spe$Barcode (if it's added) and spe$sample_id (or spe$Sample) info which would be ok if users know that will happen and colnames(spe) will be overwritten by cbind(). Hm.....

Regarding the gene info,

I actually decided to overwrite what's included, so I personally don't care. I do so at https://github.com/LieberInstitute/spatialLIBD/blob/master/R/read10xVisiumWrapper.R#L86-L113 where I keep even more info than what read10xVisium() currently keeps. When I mentioned this to Lukas, the sentiment was that such a change was better left of outside SpatialExperiment. Though well, if you wanted to include it, I could do a PR about it. Note that I locate the GTF automatically if there's a web_summary.html file for the first sample https://github.com/LieberInstitute/spatialLIBD/blob/master/R/read10xVisiumWrapper.R#L59-L65.

Regarding auto-completion in RStudio

That'd be nice, but well, I don't know what would need to change or hard it would be.

lmweber commented 2 years ago

Thanks @PeteHaitch for the detailed examples.

I agree that some of these differences vs. the defaults from DropletUtils::read10xCounts() could be improved, although some of them are probably also too locked in by now.

Specifically, I think we could think about changing the following in SpatialExperiment::read10xVisium():

definitely fix the issue with duplicated column names. I think my preference would be to prefix with sample_id. This is slightly different to the SCE default (prefixing with 1 or 2 etc), but it ensures internal consistency with our sample_id, and also ensures that the IDs are in the correct order in datasets with more than 10 samples
include barcode_id as a default column in colData (comparable to Barcode in SCE)
include gene_id as a default column in rowData (comparable to ID in SCE, and then this is also consistent with having the extra barcode_id column in colData). I never really use the other column (Type), although if we are going for consistency and including everything from Space Ranger, then I think it would also make sense to include this one by default.

On the question of whether to prefix colnames with sample_id even in a dataset with only a single sample, I think I lean towards simplicity and only prefixing it by default if there are multiple samples, and let people rebuild their objects (instead of cbinding) if this is a problem later. I'm not completely sure about this though.

PeteHaitch commented 2 years ago

Thanks, @lmweber.

I'd favour using the exact same colnames in colData/rowData as DropletUtils::read10xCounts() where possible.

The Type column is useful when the data contains multiple modalities (e.g., Gene Expression and Antibody Capture) to enable easy use of splitAltExps() to split the count matrix by modality. There's not much multimodal spatial data around yet, but it is coming.

On the question of whether to prefix colnames with sample_id even in a dataset with only a single sample, I think I lean towards simplicity and only prefixing it by default if there are multiple samples

I agree, and it's consistent with what DropletUtils::read10xCounts() does.

drighelli / SpatialExperiment