Pre-process Gawad data through downstream analyses

allyhawkins commented 1 year ago

Before tackling #170, we should start just by grabbing the Gawad data from the portal and running it through the most recent version of scpca-downstream-analyses. We will also want to prep the CITE-seq data to perform any cell type assignment including normalization of the CITE-seq counts. I'm filing this issue so that we can first do any of the "prep" before needing to do cell type assignments, up through our default clustering and then adding on prep of CITE-seq. Then once we have those objects we can begin by exploring assignment of cell types as discussed in #170.

sjspielman commented 1 year ago

Some thoughts on how to perform normalization for CITE-Seq data -

This section of OSCA discusses normalization for antibody-derived tag counts. To prepare for normalization, cell-wise size factors must be caculated. They recommend using the function scuttle::medianSizeFactors() for CITE-Seq data, and this is also the scuttle recommendation:

One valid application of this method lies in the normalization of antibody-derived tag counts for quantifying surface proteins. These counts are usually large enough to avoid zeroes yet are also susceptible to strong composition biases that preclude the use of librarySizeFactors. In such cases, we would also set reference to some estimate of the the ambient profile. This assumes that most proteins are not expressed in each cell; thus, counts for most tags for any given cell can be attributed to background contamination that should not be DE between cells.

However it turns out we have some a handful of zeroes in the size factors, as calculated by scuttle::medianSizeFactors(), in some of the Gawad libraries, and therefore our data is actually NOT "large enough to avoid zeroes" in spite of the omics modality. (So, one question could be about the reliability this sequencing data in the first place?)

I see a couple different possible ways to proceed here, very open to discussion:

We can try medianSizeFactors(), and if there are 0's in the output we can either...
- Change 0's to ~0, like 1e-100. This will satisfy logNormCounts() complaints about 0's which are not allowed, but technically does not preserve the data.
- Note that I don't think we'll want to wholesale excluding these cells, since dimensions will be decidedly not fun later, but if we filter out those cells entirely at this stage of the pipeline, indexing will all play nicely.
- Switch to using geometricSizeFactors() which is documented here. The "Details" section describes that this method is more appropriate for antibody-derived data which is more deeply sequenced and less likely to have 0's...but we have 0's, so...perhaps this justification is moot for us!
We can subset to only high-abundance genes before using medianSizeFactors(). Ideally, the cells coming out as 0 would no longer be 0 in this circumstance because lower-abundance genes would have been removed before calculations, but there's no guarantee that a high-abundance gene on average isn't low-abundance for a given cell! Also, without having a normalized logcounts assay yet, it's not immediately clear to me how to identify those genes.

Tagging @allyhawkins @jashapiro for thoughts!

allyhawkins commented 1 year ago

Based on this my interpretation is that a medianSizeFactor of 0 means a specific cell is showing low to 0 coverage of that particular ADT. I think it might come from the fact that we need to do some filtering of cells in the ADT matrix based on ADT coverage. If a cell has low or close to 0 coverage of an ADT then I think we would actually want to remove that cell. In looking at the OSCA book in the Quality Control section for integrating with protein abundance, they do talk about looking at CITE-seq metrics. As part of scpca-nf we add in per cell QC metrics for the CITE-seq data into the colData, but I don't believe we do any filtering of cells with low coverage of an ADT. That might be the first step we need to take prior to size factor estimation and normalization.

sjspielman commented 1 year ago

I think for our data, this section is the way to go since we do not have protein controls (but maybe we should develop this flexibly enough to handle that? or can circle back. i'm on the fence there).

A misc thought I have organizationally, as I'm halfway now through writing a stand-alone 00e script, is whether we actually want to incorporate this pre-processing into scripts/utils/preprocess-sce.R. As I think more about that option I kind of like it! Edit - eh, back on the fence.

sjspielman commented 1 year ago

I'm going to break out the part of this issue that focuses on CITE-seq processing only, since there's a couple tasks here. See #185.

allyhawkins commented 1 year ago

I think for our data, this section is the way to go since we do not have protein controls (but maybe we should develop this flexibly enough to handle that? or can circle back. i'm on the fence there).

I think we can generally assume that we don't have protein controls for now, and not worry about setting this up unless we have a dataset in which we were explicitly told that protein controls were used. I would agree that the section you linked is where to start.

jashapiro commented 1 year ago

I don't believe we do any filtering of cells with low coverage of an ADT.

I meant to say this earlier, but yes, we explicitly do not filter for low ADT coverage, as we wanted to preserve all relevant data if somebody wanted to use the expression data alone.

AlexsLemonade / sc-data-integration

Pre-process Gawad data through downstream analyses #181