AlexsLemonade / sc-data-integration

0 stars 0 forks source link

Pre-process Gawad data through downstream analyses #181

Closed allyhawkins closed 1 year ago

allyhawkins commented 1 year ago

Before tackling #170, we should start just by grabbing the Gawad data from the portal and running it through the most recent version of scpca-downstream-analyses. We will also want to prep the CITE-seq data to perform any cell type assignment including normalization of the CITE-seq counts. I'm filing this issue so that we can first do any of the "prep" before needing to do cell type assignments, up through our default clustering and then adding on prep of CITE-seq. Then once we have those objects we can begin by exploring assignment of cell types as discussed in #170.

sjspielman commented 1 year ago

Some thoughts on how to perform normalization for CITE-Seq data -

This section of OSCA discusses normalization for antibody-derived tag counts. To prepare for normalization, cell-wise size factors must be caculated. They recommend using the function scuttle::medianSizeFactors() for CITE-Seq data, and this is also the scuttle recommendation:

One valid application of this method lies in the normalization of antibody-derived tag counts for quantifying surface proteins. These counts are usually large enough to avoid zeroes yet are also susceptible to strong composition biases that preclude the use of librarySizeFactors. In such cases, we would also set reference to some estimate of the the ambient profile. This assumes that most proteins are not expressed in each cell; thus, counts for most tags for any given cell can be attributed to background contamination that should not be DE between cells.

However it turns out we have some a handful of zeroes in the size factors, as calculated by scuttle::medianSizeFactors(), in some of the Gawad libraries, and therefore our data is actually NOT "large enough to avoid zeroes" in spite of the omics modality. (So, one question could be about the reliability this sequencing data in the first place?)

I see a couple different possible ways to proceed here, very open to discussion:

Tagging @allyhawkins @jashapiro for thoughts!

allyhawkins commented 1 year ago

Based on this my interpretation is that a medianSizeFactor of 0 means a specific cell is showing low to 0 coverage of that particular ADT. I think it might come from the fact that we need to do some filtering of cells in the ADT matrix based on ADT coverage. If a cell has low or close to 0 coverage of an ADT then I think we would actually want to remove that cell. In looking at the OSCA book in the Quality Control section for integrating with protein abundance, they do talk about looking at CITE-seq metrics. As part of scpca-nf we add in per cell QC metrics for the CITE-seq data into the colData, but I don't believe we do any filtering of cells with low coverage of an ADT. That might be the first step we need to take prior to size factor estimation and normalization.

sjspielman commented 1 year ago

I think for our data, this section is the way to go since we do not have protein controls (but maybe we should develop this flexibly enough to handle that? or can circle back. i'm on the fence there).

A misc thought I have organizationally, as I'm halfway now through writing a stand-alone 00e script, is whether we actually want to incorporate this pre-processing into scripts/utils/preprocess-sce.R. As I think more about that option I kind of like it! Edit - eh, back on the fence.

sjspielman commented 1 year ago

I'm going to break out the part of this issue that focuses on CITE-seq processing only, since there's a couple tasks here. See #185.

allyhawkins commented 1 year ago

I think for our data, this section is the way to go since we do not have protein controls (but maybe we should develop this flexibly enough to handle that? or can circle back. i'm on the fence there).

I think we can generally assume that we don't have protein controls for now, and not worry about setting this up unless we have a dataset in which we were explicitly told that protein controls were used. I would agree that the section you linked is where to start.

jashapiro commented 1 year ago

I don't believe we do any filtering of cells with low coverage of an ADT.

I meant to say this earlier, but yes, we explicitly do not filter for low ADT coverage, as we wanted to preserve all relevant data if somebody wanted to use the expression data alone.