HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
64 stars 30 forks source link

CyTOF data preprocessing error #155

Open xie186 opened 3 years ago

xie186 commented 3 years ago

Thank you for the nice work flow published on F1000: https://f1000research.com/articles/9-1263/v1

Question 1:

I am able to successfully run the R code upto the code below:

# split SCE by sample
fs <- sce2fcs(sub,
              assay = "compexprs",
              split_by = "bc_id",
              keep_cd = TRUE)

The code above give me an error: Error: Argument 'exprs' must be numeric matrix with colnames attribute set

I checked sub object.

> colnames(sce@assays@data$counts)
NULL

The other two judgement were all TRUE. What's your suggestion to

Question 2: My understanding is the values in compexprs is compensated and normalized expr and if I want to do clustering and tSNE/UMAP, I should use the values compexprs after gating. Please let me know whether my understanding is correct? Thanks.

Here is my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.0.1             reshape2_1.4.4              ggcyto_1.16.0               ncdfFlow_2.34.0            
 [5] BH_1.72.0-3                 RcppArmadillo_0.10.1.0.0    ggplot2_3.3.2               openCyto_2.0.0             
 [9] mvtnorm_1.1-1               flowWorkspace_4.0.6         flowCore_2.0.0              dplyr_1.0.2                
[13] CATALYST_1.12.2             SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
[17] matrixStats_0.57.0          Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
[21] IRanges_2.22.1              S4Vectors_0.26.0            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                circlize_0.4.10             drc_3.0-1                   plyr_1.8.6                 
  [5] igraph_1.2.6                ConsensusClusterPlus_1.52.0 splines_4.0.3               BiocParallel_1.22.0        
  [9] fda_5.1.5.1                 scater_1.16.0               TH.data_1.0-10              digest_0.6.26              
 [13] viridis_0.5.1               magrittr_1.5                CytoML_2.0.0                cluster_2.1.0              
 [17] ks_1.11.7                   openxlsx_4.2.2              ComplexHeatmap_2.4.2        RcppParallel_5.0.2         
 [21] R.utils_2.10.1              sandwich_3.0-0              cytolib_2.0.2               jpeg_0.1-8.1               
 [25] colorspace_1.4-1            rrcov_1.5-5                 ggrepel_0.8.2               haven_2.3.1                
 [29] xfun_0.18                   crayon_1.3.4                RCurl_1.98-1.2              jsonlite_1.7.1             
 [33] hexbin_1.28.1               graph_1.66.0                survival_3.2-7              zoo_1.8-8                  
 [37] glue_1.4.2                  flowClust_3.26.0            gtable_0.3.0                nnls_1.4                   
 [41] zlibbioc_1.34.0             XVector_0.28.0              GetoptLong_1.0.4            car_3.0-10                 
 [45] BiocSingular_1.4.0          IDPmisc_1.1.20              Rgraphviz_2.32.0            shape_1.4.5                
 [49] DEoptimR_1.0-8              abind_1.4-5                 scales_1.1.1                Rcpp_1.0.4.6               
 [53] plotrix_3.7-8               viridisLite_0.3.0           clue_0.3-57                 foreign_0.8-80             
 [57] rsvd_1.0.3                  mclust_5.4.6                FlowSOM_1.20.0              tsne_0.1-3                 
 [61] RColorBrewer_1.1-2          ellipsis_0.3.1              farver_2.0.3                pkgconfig_2.0.3            
 [65] XML_3.99-0.3                R.methodsS3_1.8.1           flowViz_1.52.0              labeling_0.4.2             
 [69] flowStats_4.0.0             tidyselect_1.1.0            rlang_0.4.7                 munsell_0.5.0              
 [73] cellranger_1.1.0            tools_4.0.3                 generics_0.0.2              ggridges_0.5.2             
 [77] stringr_1.4.0               yaml_2.2.1                  zip_2.1.1                   robustbase_0.93-6          
 [81] purrr_0.3.4                 RBGL_1.64.0                 R.oo_1.24.0                 compiler_4.0.3             
 [85] rstudioapi_0.11             beeswarm_0.2.3              curl_4.3                    png_0.1-7                  
 [89] tibble_3.0.4                pcaPP_1.9-73                stringi_1.5.3               forcats_0.5.0              
 [93] lattice_0.20-41             Matrix_1.2-18               vctrs_0.3.4                 pillar_1.4.6               
 [97] lifecycle_0.2.0             BiocManager_1.30.10         GlobalOptions_0.1.2         BiocNeighbors_1.6.0        
[101] data.table_1.13.0           cowplot_1.1.0               bitops_1.0-6                irlba_2.3.3                
[105] corpcor_1.6.9               R6_2.4.1                    latticeExtra_0.6-29         KernSmooth_2.23-17         
[109] gridExtra_2.3               RProtoBufLib_2.0.0          rio_0.5.16                  vipor_0.4.5                
[113] codetools_0.2-17            MASS_7.3-53                 gtools_3.8.2                rjson_0.2.20               
[117] withr_2.3.0                 mnormt_1.5-7                multcomp_1.4-14             GenomeInfoDbData_1.2.3     
[121] hms_0.5.3                   grid_4.0.3                  DelayedMatrixStats_1.10.0   carData_3.0-4              
[125] Rtsne_0.15                  base64enc_0.1-3             ellipse_0.4.2               tinytex_0.26               
[129] ggbeeswarm_0.6.0 
HelenaLC commented 3 years ago

Thank you for the kind words!

  1. You said you checked the sub object, but your line above has sce in it? Could you maybe print bout sce and sub here so I can have a look what these looks like?

  2. Yes, compexprs contain compensated normalized expression values - assuming that you ran compCytof(..., assay = "normcounts") (i.e., running the compensation on the normalized data, not the default assay = "counts").

xie186 commented 3 years ago

Thank you for your response. This is the print output:

> print(sce)
class: SingleCellExperiment 
dim: 63 337525 
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(63): 75As CD15 ... 208Pb CD45
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(6): file_id remove ... delta mhl_dist
reducedDimNames(0):
altExpNames(0):
> print(sub)
class: SingleCellExperiment 
dim: 3 337525 
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(3): DNA1 DNA2 dead
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(2): bc_id i
reducedDimNames(0):
altExpNames(0):

I also pasted here my code. Basically the code is from your paper:

#' ## Load library
#' 

library(CATALYST)
library(dplyr)
library(flowCore)
library(flowWorkspace)

library(mvtnorm)
library(openCyto)
library(ggcyto)
library(ggplot2)

qc_theme <- list(
  theme_bw(base_size = 8), theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold"),
    axis.text = element_text(color = "black"),
    axis.text.x = element_text(angle = 45, hjust = 1)))

#' ## Read `FCS` files
#' 
#' A SCE can be constructed using CATALYST’s `prepData()` function, which accepts 
#' a path to a directory with one or many FCS files, a character vector of FCS 
#' filenames, a single or list of flowFrame(s), or a
#'  https://bioconductor.org/packages/3.11/bioc/html/flowCore.html (flowCore package24). 
#'  By default (transform = TRUE), an arcsinh-transformation with cofactor = 5 is applied to 
#'  the input (count) data, and the resulting expression matrix is stored in the exprs assay 
#'  slot of the output SCE:

# construct ’SingleCellExperiment’
data_path = "/depot/bioinf/data/Projects/CyTOF/test_data/"
setwd("/depot/bioinf/data/Projects/CyTOF/test_data/")
fcs <- list.files(data_path, "acquisition", full.names = TRUE)
fcs
(sce <- prepData(fcs, transform = TRUE, cofactor = 5))

## The cofactor used for transformation 
(int_metadata(sce)$cofactor)

## dimmencial of counts
class(sce@assays@data$counts)
dim(sce@assays@data$counts)

## cofactor-5 arcsinh-transformed counts 
class(sce@assays@data$exprs)
dim(sce@assays@data$exprs)

## How many cells in each file
head(colData(sce))
table(colData(sce))
#cell_num <- data.frame(
#  file_id = levels(sce$file_id),
#  n_cells = tabulate(sce$file_id))
#cell_num

## we rename the cell metadata variable sample_id to file_id to avoid ambiguity:
i <- match("sample_id", names(colData(sce)))
names(colData(sce))[i] <- "file_id"

# store character vector or channels names
chs <- channels(sce)
# store DNA & live channel indices
dna <- grep("^Ir", chs)
live <- grep("191|194", chs)

#' ## Filtering for stable signal
# plot channels of interest vs. time
coi <- chs[c(dna[1], which(rowData(sce)$use_channel))]
coi
plotScatter(sce, chs = c("Time", coi), label = "both") +
  labs(y = "expression") +
  scale_x_continuous(
    expression("Time ("*10^6~"ms)"),
    labels = function(u) u/1e6) +
  theme_bw(base_size = 8) + theme(
    aspect.ratio = 2/3,
    panel.grid = element_blank(),
    axis.text = element_text(color = "black"),
    strip.background = element_rect(fill = NA))

#' ## Data filtering

## Let's assume we want to remove 0 to 1 10^6 ms
# construct ’GatingSet’
ff <- sce2fcs(sce[dna, ], assay = "exprs")
gs <- GatingSet(flowSet(ff))

# apply rectangular gate to exclude unstable signal
min_t <- 0
max_t <- 1000000
gs_add_gating_method(
  gs, alias = "stable",
  pop = "-", parent = "root",
  dims = paste0("Time,", chs[dna[1]]),
  gating_method = "boundary",
  gating_args = sprintf("min=c(%s,0),max=c(%s,10)", min_t, max_t))

# plot scatter of DNA vs. Time
ggcyto(gs,
       aes_string("Time", chs[dna[1]])) +
  geom_hex(bins = 128) +
  geom_gate("stable") +
  facet_null() + theme_bw() +
  ggtitle(NULL) + theme(
    legend.position = "none",
    panel.grid = element_blank())

# subset selected events
sce_fil <- sce[, gh_pop_get_indices(gs, "stable")]
plotScatter(sce_fil, chs = c("Time", coi), label = "both") +
  labs(y = "expression") +
  scale_x_continuous(
    expression("Time ("*10^6~"ms)"),
    labels = function(u) u/1e6) +
  theme_bw(base_size = 8) + theme(
    aspect.ratio = 2/3,
    panel.grid = element_blank(),
    axis.text = element_text(color = "black"),
    strip.background = element_rect(fill = NA))

#' ## Normalization

# specify path to reference beads
ref_beads <- file.path(data_path, "normalization_beads.fcs")
# apply bead-based normalization
system.time(res <- normCytof(sce,
                             beads = "dvs", norm_to = ref_beads,
                             remove_beads = TRUE, overwrite = FALSE))

#' When remove_beads = TRUE (the default), normCytof() will return a list of three SCEs containing filtered,
#'  `bead` and `remove` events, respectively, as well as two ggplot objects:
names(res)
table(res$data$is_bead)
table(res$data$remove)
table(res$data$file_id)

# view no. of remaining, bead & removed events
ns <- sapply(res[1:3], ncol)
ps <- sprintf("%1.2f", ns/ncol(sce)*100)
stat_cell_num <- data.frame(t(cbind("# events" = ns, "% of total" = ps)))
stat_cell_num

#' As a first quality control plot, `res$scatter` renders scatter plots of bead channels (x-axis) 
#' versus DNA (y-axis), where events identified as beads as well as their expression range are 
#' highlighted in color; bead events should have low DNA intensity (since they are not cells) and 
#' high intensities across all bead channels.

res$scatter

res$lines

# compute mean bead channel counts for current run
(rowData(res$beads))
is_bead <- rowData(res$beads)$bead_ch   # get bead channels
bead_cs <- counts(res$beads)[is_bead, ] # subset counts
rownames(bead_cs) <- chs[is_bead]       # use channels as names
(bead_ms <- rowMeans(bead_cs))           # compute means

# read in reference mean bead channel counts
ref <- read.csv(file.path(data_path, "ref_bead_counts.csv"))

# join into single tidy data.frame
df <- bind_rows(ref, bead_ms, .id = "group")
library(reshape2)
df <- melt(df, id.var = "group")

# boxplot of reference vs. current run’s mean bead channel counts
ggplot(df, aes(variable, value)) +
  geom_boxplot(data = df[df$group == 1, ]) +
  geom_point(data = df[df$group == 2, ],
             col = "red", pch = 4, stroke = 1) +
  labs(x = "bead channel", y = "mean count") +
  qc_theme + ggtitle(
    "QC on bead channel counts",
    "[-] = reference | x = current run")

#' After normalization, we overwrite the input dataset with the filtered subset 
#' that no longer includes bead events, or bead-bead and bead-cell doublets:
sce <- res$data

#' ## Debarcoding

# read in debarcoding scheme
fn <- file.path(data_path, "debarcoding_scheme.csv")
bc_key <- read.csv(fn, row.names = 1, check.names = FALSE)

# all barcodes are positive for exactly 3 barcoding channels
all(rowSums(bc_key) == 3)

# remove empty barcodes from debarcoding scheme
is_empty <- grepl("empty", rownames(bc_key))
bc_key <- bc_key[!is_empty, ]
bc_ids <- rownames(bc_key)

# do preliminary barcode assignments
system.time(sce <- assignPrelim(sce, bc_key, assay = "normexprs"))

# view barcode channels
channels(sce)[rowData(sce)$is_bc]
table(rowData(sce)$is_bc)

# tabulate number of (un)assigned events
tab_bc_id = table(sce$bc_id == 0)
tab_bc_id[2]/(tab_bc_id[1]+tab_bc_id[2])

#' ## Estimation of separation cutoffs

plotYields(sce, which = "0")

sce <- estCutoffs(sce)
metadata(sce)$sep_cutoffs

plotYields(sce, which = "PBMC_R1")

# store preliminary barcode IDs
bc_ids0 <- sce$bc_id

# apply global & sample-specific separation cutoff(s)
sce_glob <- applyCutoffs(sce, sep_cutoffs = 0.15, mhl_cutoff = 30)
sce_spec <- applyCutoffs(sce, mhl_cutoff = 30)

# compare cell yields for both cutoff strategies
c(global = mean(sce_glob$bc_id == 0),
  specific = mean(sce_spec$bc_id == 0))

# proceed with sample-specific filtering
sce <- sce_spec

# compute number of events per population
# before vs. after applying separation cutoffs
barplot(rbind(table(bc_ids0), table(sce$bc_id)),
        beside = TRUE, ylab = "cell count",
        las = 2, cex.axis = 0.5, cex.names = 0.5)
legend("topright", fill = c("black", "grey"),
       legend = c("before filtering", "after filtering"))

# event plots for unassigned events
# & barcode population D1
plotEvents(sce, which = c(0, "Tumor_S1", "Tumor_S2"), n = 25)

# plotMahal(sce, which = "0")

#' ## Compensation

# read in pre-computed spillover matrix
sm <- file.path(data_path, "spillover_matrix.csv")
sm <- read.csv(sm, row.names = 1)
# apply NNLS compensation
system.time(
  sce <- compCytof(sce, sm, method = "nnls",
                   assay = "normcounts", overwrite = FALSE))

#' To visually inspect how compensation affects signal intensities, we can generate scatter plots 
#' pre- and post-compensation; an exemplary pair of channels is shown in the figure. In such a plot, 
#' we can observe a slight positive association between the signals of spill-affected channels, 
#' which should be removed upon compensation.

i <- grep("173|174", chs, value = TRUE)
p1 <- plotScatter(sce,
                  chs = i,
                  label = "channel",
                  assay = "normexprs") +
  ggtitle("Uncompensated")
p2 <- plotScatter(sce,
                  chs = i,
                  label = "channel",
                  assay = "compexprs") +
  ggtitle("Compensated") +
  ylab(NULL)
library(patchwork)
wrap_plots(p1, p2)

#' ## Gating

#' 
.scatter <- function(gs, chs, gate_id = NULL,
                     subset = ifelse(is.null(gate_id), "root", "_parent_")) {
  p <- ggcyto(gs, max_nrow_to_plot = 1e5,
              aes_string(chs[1], chs[2]), subset) +
    geom_hex(bins = 100) + facet_wrap(~ name, ncol = 5) +
    (if (is.null(gate_id)) list() else geom_gate(gate_id)) +
    ggtitle(NULL) + theme_bw(base_size = 8) + theme(
      aspect.ratio = 1,
      legend.position = "none",
      panel.grid.minor = element_blank(),
      strip.background = element_rect(fill = NA),
      axis.text = element_text(color = "black"),
      axis.text.x = element_text(angle = 45, hjust = 1))
  suppressMessages(p + coord_equal(expand = FALSE,
                                   xlim = c(-1, 11), ylim = c(-1, 11)))
}

#' ### Gating on cells

# subset DNA & live channels
sub <- sce[union(dna, live), ]

# add metadata variable ’i’ to track cell indices
colData(sub) <- DataFrame(
  bc_id = sub$bc_id,
  i = seq_len(ncol(sce)))

## split SCE by sample
fs <- sce2fcs(sub,
              assay = "compexprs",
              split_by = "bc_id",
              keep_cd = TRUE)