Explore post alevin-fry-unfiltered filtering strategies

allyhawkins commented 3 years ago

In the most recent benchmarking of alevin-fry using the --unfiltered-pl setting followed by filtering with DropletUtils::emptyDrops(), there were instances in which emptyDrops() was very liberal and allowed for very high numbers of cells (~ 20K) to be considered as true cells. These cells were those that had lower UMIs/cell and genes/cell and should likely be filtered out.

We would like to provide both the unfiltered and filtered counts matrix to users, so should explore some other filtering options. The following filtering options that could be explored include:

emptyDrops() alone or followed by removal of cells not passing thresholds
miQC for filtering
cellRangerLikeEmptyDrops()

Some metrics to explore as a result of filtering is to look at the number of cells that are lost in each filtering strategy, the distribution of UMI/cell and genes/cell in cells that are filtered vs. not filtered, and effect on genes (i.e. what types of genes are lost by removing those cells).

jashapiro commented 3 years ago

I was looking at cellRangerLikeEmptyDrops() and while I think it is something we should be exploring, I would suggest that we use the version that is in https://github.com/MarioniLab/DropletUtils/pull/66, which is likely to be integrated into the latest version of DropletUtils in the not too distant future. I think we can install and evaluate this branch using remotes::install_github(MarioniLab/DropletUtils#cellranger)

allyhawkins commented 3 years ago

Based on the promising results from @cbethell looking at miQC in AlexsLemonade/dana-single-cell#15, we are very likely going to be using miQC for filtering out poor quality cells after using either emptyDrops or cellRangerLikeEmptyDrops at least in single-cell samples.

Ideally we would like to be able to use miQC for both single-cell and single-nuclei samples, but are unsure of whether or not the lack of mitochondrial reads that should be observed in single-nuclei samples will break the model assumptions of miQC. @cgreene has your group had the chance to work with miQC with single-nuclei samples or have any advice for how to use miQC or a similar approach with single-nuclei samples?

allyhawkins commented 2 years ago

Since filing this issue we have completed two analysis to explore filtering of the output from alevin-fry using the --unfiltered-pl option to obtain all possible cells detected in the experiment.

The first comparison was to compare use of emptyDrops to emptyDropsCellRanger to remove cells that were likely to be empty droplets from the counts matrix. Here, we also tested different thresholds of the UMI minimum in creating the ambient profile, by using lower=200 and lower=500 with emptyDrops. The findings are summarized in this notebook and were filed in https://github.com/AlexsLemonade/alsf-scpca/pull/128. Overall, we found that increasing the lower threshold from the default lower=100 for emptyDrops resulted in a more similar number of cells being retained to Cell Ranger. In particular, lower=200 usually gives a cell number either equal to or slightly higher than Cell Ranger, while lower=500 results in numbers that are either equal or slightly lower than Cell Ranger. Alternatively, we tested emptyDropsCellRanger, which although looked the most consistent with Cell Ranger, this is not fully supported by DropletUtils yet and had installation issues at the time.

The second comparison was to look at the output after removing empty drops and to determine if we could be using miQC for both single-cell and single-nuclei samples to inform further filtering of samples to remove low quality cells. This most current analysis is present in https://github.com/AlexsLemonade/alsf-scpca/pull/130 and a summary of the analysis can be found in this notebook. Here, we found that use of the default parameters for miQC were the most consistent and worked well for single-cell samples. For single-nuclei samples, the lower range in mito content resulted in a cells with lower mito content to be considered compromised, which is expected considering single-nuclei samples should have little to no mito reads. We also decided not to perform any further filtering, but rather include a column with the posterior probability as computed by miQC, and an additional column with suggestions on which cells to filter in the colData of the filtered sce object.

After these analysis, next steps to complete our exploration of filtering options include:

Making a decision on which lower threshold we would like to use for emptyDrops.
Determining what criteria we want to include for our "suggested filtering" column. This is briefly discussed in https://github.com/AlexsLemonade/alsf-scpca/issues/134. Now that we know we want to incorporate miQC, do we want to add any additional criteria (i.e. also including cells with low mito and high genes detected) that may have low probability compromised according to miQC. In determining this, how are we going to compare that we are doing the "best filtering".

A potential game plan to address point 2 is:

Perform various filtering (i.e. miQC filtering with or without additional criteria)
Look at the distribution of mito content and number of unique genes colored by if cells are filtered or removed
Perform basic PCA and UMAP
Identify if any clusters appear that are composed of cells with low viability

allyhawkins commented 2 years ago

Since we have made decisions to use emptyDrops with lower=200 and add the miQC probability to the colData for now can we close this? The next steps would be to continue to make decisions on the ccdl_suggests column which I believe is addressed in #134.

jaclyn-taroni commented 2 years ago

Sure

AlexsLemonade / alsf-scpca

Explore post alevin-fry-unfiltered filtering strategies #105