AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Add exploration of miQC filtering #130

Closed allyhawkins closed 2 years ago

allyhawkins commented 3 years ago

This PR addresses the second points discussed in #105 of whether or not we can introduce usage of miQC as a secondary filtering step after emptyDrops. Here, I am comparing 2 of the single-cell samples and 2 of the single-nuclei samples that have been run through Alevin-fry with selective alignment to the splici index with the --unfiltered-pl filtering option and cr-like-em resolution (configurations chosen for processing). These samples have previously been filtered to remove any potential empty droplets using emptyDrops with lower=200 as described in #128. I am keeping this in a draft stage right now because as we decide the thresholds to use for emptyDrops, I think we will want to update the input to this notebook.

For the most part I chose to follow the miQC vignette and started by using the plotMetrics() function to inspect each of the samples and see if they fit the assumptions of the model. One of these assumptions is that they have cells with a range of mitochondrial contents, which when looking at the plotMetrics() output for the single-nuclei samples (SCPCR000118, SCPCR000119), it appears that this is not the case.

I went ahead and applied miQC filtering using different posterior cutoffs and different non-linear mixture models in addition to the default settings and observed relatively similar results, regardless of the model that is used for the single-nuclei samples. I then compared the filtering to use of pre-determined thresholds to see how many cells are being removed and how filtering affects the distribution of UMIs and genes detected/cell.

Some questions that I still have:

jaclyn-taroni commented 3 years ago

I have the same comment here as https://github.com/AlexsLemonade/alsf-scpca/pull/128#issuecomment-910245006.

arielah commented 3 years ago

Hi @jashapiro, you're right that there being high-count, low-mito cells being excluded is undesired behavior. In the devel version of the miQC package on bioconductor (should be released in November I think?), we've added a new parameter keep_all_below_cutoff=TRUE to the filtering functions make sure these cells are not excluded. You can either run the development version or do the cutoff you mentioned, which should accomplish roughly the same thing. Hope this is helpful!

allyhawkins commented 3 years ago

Thanks for the feedback @jashapiro. I went ahead and updated the plots to remove the print outs of the lists along with the plots, make sure all the plots are named, removed the extraneous boxplots, and added in some commentary per your suggestions.

It might be worth considering when we include our "ccdl_suggests" column that we integrate the posterior probability with also keeping cells that have above a high threshold of genes detected and low mito content as you suggested when we make those recommendations. So we don't lose those few cells that appear to get lost with the default settings of miQC that we agreed to use moving forward.

cgreene commented 3 years ago

For miQC, an assumption is that you can successfully fit the mixture model. If your data are not a mixture (i.e., bad "cells" have already been filtered, you don't really have mitochondria, or the data are so clean that few cells are compromised) or if the compromised fraction is so small that you can't accurately estimate, miQC is probably a bad idea.

jashapiro commented 2 years ago

I just want to add a note here for @arielah that SCPCR000126 in this case represents an interesting pathological case for the spline fit where the model intercept (which I think is used to determine which group to keep) happens to be lower in what should be the the compromised group, resulting in misclassification. I don't know how often people are using splines (they seem dangerous), but I thought it was worth a note. Not sure I have a good idea of a better solution... maybe using some kind of average of the lower portion of the fit rather than just the intercept point?

Screen Shot 2021-09-08 at 10 32 38 AM
allyhawkins commented 2 years ago

What are our goals with these comparisons? Looking at the cell counts seems to be a pretty low-level metric and it is not clear how that relates to whether we are doing a good or bad job at removing compromised cells.

What are the next steps after this? How should we think about these results when making recommendations on filtering?

I would agree that only looking at the cell count is not the most informative and just gives us an idea of how many cells we are throwing out using each method. The goal with this PR was to see if using miQC was a viable option for the single-cell AND single-nuclei samples. The main focus being does applying the miQC mixture model to our samples work as expected in suggesting filtering of compromised cells. There are a few conclusions I would make after looking at the results:

  1. Using the default parameters for miQC with the linear mixture model gives the most consistent and reliable results where the majority of cells with low mito and high unique genes are classified as cells to keep.
  2. Cells from single-nuclei samples do not show as high a range of mito content and therefore the mixture model suggests a cutoff that leads to cells with a lower mito content than in single-cell being classified as compromised. This however is slightly expected considering single-nuclei samples should have little to no mito reads.
  3. From this we decided that we would not actually perform any further filtering ourselves, but rather apply miQC to each sample using the linear mixture model to obtain the posterior probability of a cell being compromised. Then we would create a ccdl_suggests column to identify which cells we would suggest filtering.

The next steps to me are to identify the criteria that we need to recommend filtering. Now that we have settled on using miQC then I would say we need to identify if there are any other boundaries (such as combining a probability cutoff with a genes detected and mito cutoff) we should consider when creating the ccdl_suggests column.

I actually think if we want to evaluate the effects we may need to do something a bit more in depth where we look at PCA or UMAP results, and see if we are still getting clusters that seem to correspond to "bad cells". That is probably too much work at this stage, but I feel like it is something we might want to explore more, and at least point toward at the bottom of this document.

I think having additional PCA and UMAP comparisons would be helpful, and perhaps something to be added when we determine the criteria needed for suggested filtering in a future PR. I can add a few lines at the bottom of the PR to address these final conclusions and thoughts.

allyhawkins commented 2 years ago

@jashapiro I went ahead and fixed the mito cutoff and added in comparisons of two different mito cutoffs and using just a mito cutoff with no genes detected to look at the number of cells that get removed. I also adjusted the last plot looking at the number of cells to be a bar plot to make it easier to compare across the various filtering settings within each sample.

Additionally, I added some concluding remarks at the end of the file on some thoughts about what we learned here and next steps. Feel free to take a look at the updated rendered html file when you get a chance and let me know if you are okay with merging?

cgreene commented 2 years ago

Just working through my inbox in stochastic order and saw the splines: definitely wouldn't recommend using splines. Also, I think the version of miQC associated with the paper has a flag to avoid throwing out cells in certain pathologic cases (i.e., never throw out a cell when one with fewer genes or higher mitochondrial content is preserved). If you're going to run miQC over many samples and may not manually inspect the results, this may be a thing you want to use.