Subset Cells with Gene Label

daacarri commented 5 years ago

Hello,

I'm having difficulty making use of the addGeneLabel functionality.

I was wondering if there was a way to subset cells based off a gene "signature", using the addGenelabel and subset functions?

For example, finding out which cells express a certain proportion of a gene list (signature) and then subsetting those cells or adding them as a condition in colInfo?

Like if I had a gene set (signature) for "stressed" and then figuring out how many cells possess a certain percentage of that signature, (say some % of the cells possess some % of that signature and thats what I decide is "stressed"). Then labeling them as "stressed" and the other cells as "Not-Stressed" as a condition so that I can subset or do further analysis.

asenabouth commented 5 years ago

Hi @daacarri , Apologies for the delayed response. It is possible to do this; here is how I would go about it.

# List of BRN-3 transcription factors
gene_markers <- c("POU4F1", "POU4F2", "POU4F3")

# Check they are used in this expression matrix
gene_markers %in% rownames(em_set)

# Loop to add gene marker as a separate column in colInfo
for (gene in gene_markers){
  em_set <- addGeneLabel(em_set, gene = gene)
}

# Check information has been added correctly to the EMSet
colInfo(em_set)

This should give you something like this:

cell_barcode     batch    POU4F1    POU4F2    POU4F3
                          <character> <numeric> <logical> <logical> <logical>
AAACCTGAGCTGTTCA-1 AAACCTGAGCTGTTCA-1         1     FALSE     FALSE     FALSE
AAACCTGCAATTCCTT-1 AAACCTGCAATTCCTT-1         1     FALSE     FALSE     FALSE
AAACCTGGTCTACCTC-1 AAACCTGGTCTACCTC-1         1     FALSE     FALSE     FALSE
AAACCTGTCGGAGCAA-1 AAACCTGTCGGAGCAA-1         1     FALSE     FALSE     FALSE

So you can then use this to figure out the percentage of markers each cell is expressing, identify stressed cells and subset them.

# Extract col_info from EMSet
col_info <- colInfo(em_set)

# One liner to count number of TRUE markers per cell
marker_pct <- rowSums(as.data.frame(col_info[, gene_markers]))/length(gene_markers)

# Convert to DataFrame so we can add this information back into col_info
col_info <- cbind(col_info, S4Vectors::DataFrame(marker_pct))

# Create a new column in col_info to mark if the cell is stressed or not
col_info$condition <- "Not stressed"

# Identify cells that contain some marker expression and label them as "stressed".
# For this example, we are using 33% of markers
col_info$condition[which(col_info$marker_pct >= 0.33)] <- "Stressed"

# You can review number of cells that are stressed or not stressed with table
# function
table(col_info$condition)

# Update colInfo slot with col_info dataframe
colInfo(em_set) <- col_info

# Then subset as normal
stressed_cells <- subsetCondition(em_set, by = "condition", conditions = list(condition = "Stressed"))

Hope that helps - let me know if you are still stuck.

daacarri commented 5 years ago

Wow thanks @asenabouth!

This is actually more than I could have asked for.

I eventually figured out how to manipulate the em_set based off the tutorial but went another direction.

I actually went the route of using a package called AUCell that finds cells with my gene signature and then I labeled those cells in the em_set as per the vignette.

But I actually like your way better, will probably use either or as a sanity check!

Thanks again!

asenabouth commented 5 years ago

Hi @daacarri , I think AUCell would be a better method to use to identify the cells, but you can easily incorporate the results into the colInfo dataframe for use with other ascend package functions. Glad you found it helpful though!

Anne

IMB-Computational-Genomics-Lab / ascend

Subset Cells with Gene Label #28