CyTOF data preprocessing error

Thank you for the nice work flow published on F1000: https://f1000research.com/articles/9-1263/v1

Question 1:

I am able to successfully run the R code upto the code below:

# split SCE by sample
fs <- sce2fcs(sub,
              assay = "compexprs",
              split_by = "bc_id",
              keep_cd = TRUE)

The code above give me an error: Error: Argument 'exprs' must be numeric matrix with colnames attribute set

I checked sub object.

> colnames(sce@assays@data$counts)
NULL

The other two judgement were all TRUE. What's your suggestion to

Question 2: My understanding is the values in compexprs is compensated and normalized expr and if I want to do clustering and tSNE/UMAP, I should use the values compexprs after gating. Please let me know whether my understanding is correct? Thanks.

Here is my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.0.1             reshape2_1.4.4              ggcyto_1.16.0               ncdfFlow_2.34.0            
 [5] BH_1.72.0-3                 RcppArmadillo_0.10.1.0.0    ggplot2_3.3.2               openCyto_2.0.0             
 [9] mvtnorm_1.1-1               flowWorkspace_4.0.6         flowCore_2.0.0              dplyr_1.0.2                
[13] CATALYST_1.12.2             SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
[17] matrixStats_0.57.0          Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
[21] IRanges_2.22.1              S4Vectors_0.26.0            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                circlize_0.4.10             drc_3.0-1                   plyr_1.8.6                 
  [5] igraph_1.2.6                ConsensusClusterPlus_1.52.0 splines_4.0.3               BiocParallel_1.22.0        
  [9] fda_5.1.5.1                 scater_1.16.0               TH.data_1.0-10              digest_0.6.26              
 [13] viridis_0.5.1               magrittr_1.5                CytoML_2.0.0                cluster_2.1.0              
 [17] ks_1.11.7                   openxlsx_4.2.2              ComplexHeatmap_2.4.2        RcppParallel_5.0.2         
 [21] R.utils_2.10.1              sandwich_3.0-0              cytolib_2.0.2               jpeg_0.1-8.1               
 [25] colorspace_1.4-1            rrcov_1.5-5                 ggrepel_0.8.2               haven_2.3.1                
 [29] xfun_0.18                   crayon_1.3.4                RCurl_1.98-1.2              jsonlite_1.7.1             
 [33] hexbin_1.28.1               graph_1.66.0                survival_3.2-7              zoo_1.8-8                  
 [37] glue_1.4.2                  flowClust_3.26.0            gtable_0.3.0                nnls_1.4                   
 [41] zlibbioc_1.34.0             XVector_0.28.0              GetoptLong_1.0.4            car_3.0-10                 
 [45] BiocSingular_1.4.0          IDPmisc_1.1.20              Rgraphviz_2.32.0            shape_1.4.5                
 [49] DEoptimR_1.0-8              abind_1.4-5                 scales_1.1.1                Rcpp_1.0.4.6               
 [53] plotrix_3.7-8               viridisLite_0.3.0           clue_0.3-57                 foreign_0.8-80             
 [57] rsvd_1.0.3                  mclust_5.4.6                FlowSOM_1.20.0              tsne_0.1-3                 
 [61] RColorBrewer_1.1-2          ellipsis_0.3.1              farver_2.0.3                pkgconfig_2.0.3            
 [65] XML_3.99-0.3                R.methodsS3_1.8.1           flowViz_1.52.0              labeling_0.4.2             
 [69] flowStats_4.0.0             tidyselect_1.1.0            rlang_0.4.7                 munsell_0.5.0              
 [73] cellranger_1.1.0            tools_4.0.3                 generics_0.0.2              ggridges_0.5.2             
 [77] stringr_1.4.0               yaml_2.2.1                  zip_2.1.1                   robustbase_0.93-6          
 [81] purrr_0.3.4                 RBGL_1.64.0                 R.oo_1.24.0                 compiler_4.0.3             
 [85] rstudioapi_0.11             beeswarm_0.2.3              curl_4.3                    png_0.1-7                  
 [89] tibble_3.0.4                pcaPP_1.9-73                stringi_1.5.3               forcats_0.5.0              
 [93] lattice_0.20-41             Matrix_1.2-18               vctrs_0.3.4                 pillar_1.4.6               
 [97] lifecycle_0.2.0             BiocManager_1.30.10         GlobalOptions_0.1.2         BiocNeighbors_1.6.0        
[101] data.table_1.13.0           cowplot_1.1.0               bitops_1.0-6                irlba_2.3.3                
[105] corpcor_1.6.9               R6_2.4.1                    latticeExtra_0.6-29         KernSmooth_2.23-17         
[109] gridExtra_2.3               RProtoBufLib_2.0.0          rio_0.5.16                  vipor_0.4.5                
[113] codetools_0.2-17            MASS_7.3-53                 gtools_3.8.2                rjson_0.2.20               
[117] withr_2.3.0                 mnormt_1.5-7                multcomp_1.4-14             GenomeInfoDbData_1.2.3     
[121] hms_0.5.3                   grid_4.0.3                  DelayedMatrixStats_1.10.0   carData_3.0-4              
[125] Rtsne_0.15                  base64enc_0.1-3             ellipse_0.4.2               tinytex_0.26               
[129] ggbeeswarm_0.6.0

Thank you for your response. This is the print output:

> print(sce)
class: SingleCellExperiment 
dim: 63 337525 
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(63): 75As CD15 ... 208Pb CD45
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(6): file_id remove ... delta mhl_dist
reducedDimNames(0):
altExpNames(0):
> print(sub)
class: SingleCellExperiment 
dim: 3 337525 
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(3): DNA1 DNA2 dead
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(2): bc_id i
reducedDimNames(0):
altExpNames(0):

I also pasted here my code. Basically the code is from your paper:

#' ## Load library
#' 

library(CATALYST)
library(dplyr)
library(flowCore)
library(flowWorkspace)

library(mvtnorm)
library(openCyto)
library(ggcyto)
library(ggplot2)

qc_theme <- list(
  theme_bw(base_size = 8), theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold"),
    axis.text = element_text(color = "black"),
    axis.text.x = element_text(angle = 45, hjust = 1)))

#' ## Read `FCS` files
#' 
#' A SCE can be constructed using CATALYST’s `prepData()` function, which accepts 
#' a path to a directory with one or many FCS files, a character vector of FCS 
#' filenames, a single or list of flowFrame(s), or a
#'  https://bioconductor.org/packages/3.11/bioc/html/flowCore.html (flowCore package24). 
#'  By default (transform = TRUE), an arcsinh-transformation with cofactor = 5 is applied to 
#'  the input (count) data, and the resulting expression matrix is stored in the exprs assay 
#'  slot of the output SCE:

# construct ’SingleCellExperiment’
data_path = "/depot/bioinf/data/Projects/CyTOF/test_data/"
setwd("/depot/bioinf/data/Projects/CyTOF/test_data/")
fcs <- list.files(data_path, "acquisition", full.names = TRUE)
fcs
(sce <- prepData(fcs, transform = TRUE, cofactor = 5))

## The cofactor used for transformation 
(int_metadata(sce)$cofactor)

## dimmencial of counts
class(sce@assays@data$counts)
dim(sce@assays@data$counts)

## cofactor-5 arcsinh-transformed counts 
class(sce@assays@data$exprs)
dim(sce@assays@data$exprs)

## How many cells in each file
head(colData(sce))
table(colData(sce))
#cell_num <- data.frame(
#  file_id = levels(sce$file_id),
#  n_cells = tabulate(sce$file_id))
#cell_num

## we rename the cell metadata variable sample_id to file_id to avoid ambiguity:
i <- match("sample_id", names(colData(sce)))
names(colData(sce))[i] <- "file_id"

# store character vector or channels names
chs <- channels(sce)
# store DNA & live channel indices
dna <- grep("^Ir", chs)
live <- grep("191|194", chs)

#' ## Filtering for stable signal
# plot channels of interest vs. time
coi <- chs[c(dna[1], which(rowData(sce)$use_channel))]
coi
plotScatter(sce, chs = c("Time", coi), label = "both") +
  labs(y = "expression") +
  scale_x_continuous(
    expression("Time ("*10^6~"ms)"),
    labels = function(u) u/1e6) +
  theme_bw(base_size = 8) + theme(
    aspect.ratio = 2/3,
    panel.grid = element_blank(),
    axis.text = element_text(color = "black"),
    strip.background = element_rect(fill = NA))

#' ## Data filtering

## Let's assume we want to remove 0 to 1 10^6 ms
# construct ’GatingSet’
ff <- sce2fcs(sce[dna, ], assay = "exprs")
gs <- GatingSet(flowSet(ff))

# apply rectangular gate to exclude unstable signal
min_t <- 0
max_t <- 1000000
gs_add_gating_method(
  gs, alias = "stable",
  pop = "-", parent = "root",
  dims = paste0("Time,", chs[dna[1]]),
  gating_method = "boundary",
  gating_args = sprintf("min=c(%s,0),max=c(%s,10)", min_t, max_t))

# plot scatter of DNA vs. Time
ggcyto(gs,
       aes_string("Time", chs[dna[1]])) +
  geom_hex(bins = 128) +
  geom_gate("stable") +
  facet_null() + theme_bw() +
  ggtitle(NULL) + theme(
    legend.position = "none",
    panel.grid = element_blank())

# subset selected events
sce_fil <- sce[, gh_pop_get_indices(gs, "stable")]
plotScatter(sce_fil, chs = c("Time", coi), label = "both") +
  labs(y = "expression") +
  scale_x_continuous(
    expression("Time ("*10^6~"ms)"),
    labels = function(u) u/1e6) +
  theme_bw(base_size = 8) + theme(
    aspect.ratio = 2/3,
    panel.grid = element_blank(),
    axis.text = element_text(color = "black"),
    strip.background = element_rect(fill = NA))

#' ## Normalization

# specify path to reference beads
ref_beads <- file.path(data_path, "normalization_beads.fcs")
# apply bead-based normalization
system.time(res <- normCytof(sce,
                             beads = "dvs", norm_to = ref_beads,
                             remove_beads = TRUE, overwrite = FALSE))

#' When remove_beads = TRUE (the default), normCytof() will return a list of three SCEs containing filtered,
#'  `bead` and `remove` events, respectively, as well as two ggplot objects:
names(res)
table(res$data$is_bead)
table(res$data$remove)
table(res$data$file_id)

# view no. of remaining, bead & removed events
ns <- sapply(res[1:3], ncol)
ps <- sprintf("%1.2f", ns/ncol(sce)*100)
stat_cell_num <- data.frame(t(cbind("# events" = ns, "% of total" = ps)))
stat_cell_num

#' As a first quality control plot, `res$scatter` renders scatter plots of bead channels (x-axis) 
#' versus DNA (y-axis), where events identified as beads as well as their expression range are 
#' highlighted in color; bead events should have low DNA intensity (since they are not cells) and 
#' high intensities across all bead channels.

res$scatter

res$lines

# compute mean bead channel counts for current run
(rowData(res$beads))
is_bead <- rowData(res$beads)$bead_ch   # get bead channels
bead_cs <- counts(res$beads)[is_bead, ] # subset counts
rownames(bead_cs) <- chs[is_bead]       # use channels as names
(bead_ms <- rowMeans(bead_cs))           # compute means

# read in reference mean bead channel counts
ref <- read.csv(file.path(data_path, "ref_bead_counts.csv"))

# join into single tidy data.frame
df <- bind_rows(ref, bead_ms, .id = "group")
library(reshape2)
df <- melt(df, id.var = "group")

# boxplot of reference vs. current run’s mean bead channel counts
ggplot(df, aes(variable, value)) +
  geom_boxplot(data = df[df$group == 1, ]) +
  geom_point(data = df[df$group == 2, ],
             col = "red", pch = 4, stroke = 1) +
  labs(x = "bead channel", y = "mean count") +
  qc_theme + ggtitle(
    "QC on bead channel counts",
    "[-] = reference | x = current run")

#' After normalization, we overwrite the input dataset with the filtered subset 
#' that no longer includes bead events, or bead-bead and bead-cell doublets:
sce <- res$data

#' ## Debarcoding

# read in debarcoding scheme
fn <- file.path(data_path, "debarcoding_scheme.csv")
bc_key <- read.csv(fn, row.names = 1, check.names = FALSE)

# all barcodes are positive for exactly 3 barcoding channels
all(rowSums(bc_key) == 3)

# remove empty barcodes from debarcoding scheme
is_empty <- grepl("empty", rownames(bc_key))
bc_key <- bc_key[!is_empty, ]
bc_ids <- rownames(bc_key)

# do preliminary barcode assignments
system.time(sce <- assignPrelim(sce, bc_key, assay = "normexprs"))

# view barcode channels
channels(sce)[rowData(sce)$is_bc]
table(rowData(sce)$is_bc)

# tabulate number of (un)assigned events
tab_bc_id = table(sce$bc_id == 0)
tab_bc_id[2]/(tab_bc_id[1]+tab_bc_id[2])

#' ## Estimation of separation cutoffs

plotYields(sce, which = "0")

sce <- estCutoffs(sce)
metadata(sce)$sep_cutoffs

plotYields(sce, which = "PBMC_R1")

# store preliminary barcode IDs
bc_ids0 <- sce$bc_id

# apply global & sample-specific separation cutoff(s)
sce_glob <- applyCutoffs(sce, sep_cutoffs = 0.15, mhl_cutoff = 30)
sce_spec <- applyCutoffs(sce, mhl_cutoff = 30)

# compare cell yields for both cutoff strategies
c(global = mean(sce_glob$bc_id == 0),
  specific = mean(sce_spec$bc_id == 0))

# proceed with sample-specific filtering
sce <- sce_spec

# compute number of events per population
# before vs. after applying separation cutoffs
barplot(rbind(table(bc_ids0), table(sce$bc_id)),
        beside = TRUE, ylab = "cell count",
        las = 2, cex.axis = 0.5, cex.names = 0.5)
legend("topright", fill = c("black", "grey"),
       legend = c("before filtering", "after filtering"))

# event plots for unassigned events
# & barcode population D1
plotEvents(sce, which = c(0, "Tumor_S1", "Tumor_S2"), n = 25)

# plotMahal(sce, which = "0")

#' ## Compensation

# read in pre-computed spillover matrix
sm <- file.path(data_path, "spillover_matrix.csv")
sm <- read.csv(sm, row.names = 1)
# apply NNLS compensation
system.time(
  sce <- compCytof(sce, sm, method = "nnls",
                   assay = "normcounts", overwrite = FALSE))

#' To visually inspect how compensation affects signal intensities, we can generate scatter plots 
#' pre- and post-compensation; an exemplary pair of channels is shown in the figure. In such a plot, 
#' we can observe a slight positive association between the signals of spill-affected channels, 
#' which should be removed upon compensation.

i <- grep("173|174", chs, value = TRUE)
p1 <- plotScatter(sce,
                  chs = i,
                  label = "channel",
                  assay = "normexprs") +
  ggtitle("Uncompensated")
p2 <- plotScatter(sce,
                  chs = i,
                  label = "channel",
                  assay = "compexprs") +
  ggtitle("Compensated") +
  ylab(NULL)
library(patchwork)
wrap_plots(p1, p2)

#' ## Gating

#' 
.scatter <- function(gs, chs, gate_id = NULL,
                     subset = ifelse(is.null(gate_id), "root", "_parent_")) {
  p <- ggcyto(gs, max_nrow_to_plot = 1e5,
              aes_string(chs[1], chs[2]), subset) +
    geom_hex(bins = 100) + facet_wrap(~ name, ncol = 5) +
    (if (is.null(gate_id)) list() else geom_gate(gate_id)) +
    ggtitle(NULL) + theme_bw(base_size = 8) + theme(
      aspect.ratio = 1,
      legend.position = "none",
      panel.grid.minor = element_blank(),
      strip.background = element_rect(fill = NA),
      axis.text = element_text(color = "black"),
      axis.text.x = element_text(angle = 45, hjust = 1))
  suppressMessages(p + coord_equal(expand = FALSE,
                                   xlim = c(-1, 11), ylim = c(-1, 11)))
}

#' ### Gating on cells

# subset DNA & live channels
sub <- sce[union(dna, live), ]

# add metadata variable ’i’ to track cell indices
colData(sub) <- DataFrame(
  bc_id = sub$bc_id,
  i = seq_len(ncol(sce)))

## split SCE by sample
fs <- sce2fcs(sub,
              assay = "compexprs",
              split_by = "bc_id",
              keep_cd = TRUE)

HelenaLC / CATALYST

CyTOF data preprocessing error #155