Closed allyhawkins closed 2 years ago
I have the same comment here as https://github.com/AlexsLemonade/alsf-scpca/pull/128#issuecomment-910245006.
Hi @jashapiro, you're right that there being high-count, low-mito cells being excluded is undesired behavior. In the devel version of the miQC package on bioconductor (should be released in November I think?), we've added a new parameter keep_all_below_cutoff=TRUE
to the filtering functions make sure these cells are not excluded. You can either run the development version or do the cutoff you mentioned, which should accomplish roughly the same thing. Hope this is helpful!
Thanks for the feedback @jashapiro. I went ahead and updated the plots to remove the print outs of the lists along with the plots, make sure all the plots are named, removed the extraneous boxplots, and added in some commentary per your suggestions.
It might be worth considering when we include our "ccdl_suggests" column that we integrate the posterior probability with also keeping cells that have above a high threshold of genes detected and low mito content as you suggested when we make those recommendations. So we don't lose those few cells that appear to get lost with the default settings of miQC that we agreed to use moving forward.
For miQC, an assumption is that you can successfully fit the mixture model. If your data are not a mixture (i.e., bad "cells" have already been filtered, you don't really have mitochondria, or the data are so clean that few cells are compromised) or if the compromised fraction is so small that you can't accurately estimate, miQC is probably a bad idea.
I just want to add a note here for @arielah that SCPCR000126 in this case represents an interesting pathological case for the spline fit where the model intercept (which I think is used to determine which group to keep) happens to be lower in what should be the the compromised group, resulting in misclassification. I don't know how often people are using splines (they seem dangerous), but I thought it was worth a note. Not sure I have a good idea of a better solution... maybe using some kind of average of the lower portion of the fit rather than just the intercept point?
What are our goals with these comparisons? Looking at the cell counts seems to be a pretty low-level metric and it is not clear how that relates to whether we are doing a good or bad job at removing compromised cells.
What are the next steps after this? How should we think about these results when making recommendations on filtering?
I would agree that only looking at the cell count is not the most informative and just gives us an idea of how many cells we are throwing out using each method. The goal with this PR was to see if using miQC was a viable option for the single-cell AND single-nuclei samples. The main focus being does applying the miQC mixture model to our samples work as expected in suggesting filtering of compromised cells. There are a few conclusions I would make after looking at the results:
ccdl_suggests
column to identify which cells we would suggest filtering. The next steps to me are to identify the criteria that we need to recommend filtering. Now that we have settled on using miQC then I would say we need to identify if there are any other boundaries (such as combining a probability cutoff with a genes detected and mito cutoff) we should consider when creating the ccdl_suggests
column.
I actually think if we want to evaluate the effects we may need to do something a bit more in depth where we look at PCA or UMAP results, and see if we are still getting clusters that seem to correspond to "bad cells". That is probably too much work at this stage, but I feel like it is something we might want to explore more, and at least point toward at the bottom of this document.
I think having additional PCA and UMAP comparisons would be helpful, and perhaps something to be added when we determine the criteria needed for suggested filtering in a future PR. I can add a few lines at the bottom of the PR to address these final conclusions and thoughts.
@jashapiro I went ahead and fixed the mito cutoff and added in comparisons of two different mito cutoffs and using just a mito cutoff with no genes detected to look at the number of cells that get removed. I also adjusted the last plot looking at the number of cells to be a bar plot to make it easier to compare across the various filtering settings within each sample.
Additionally, I added some concluding remarks at the end of the file on some thoughts about what we learned here and next steps. Feel free to take a look at the updated rendered html file when you get a chance and let me know if you are okay with merging?
Just working through my inbox in stochastic order and saw the splines: definitely wouldn't recommend using splines. Also, I think the version of miQC associated with the paper has a flag to avoid throwing out cells in certain pathologic cases (i.e., never throw out a cell when one with fewer genes or higher mitochondrial content is preserved). If you're going to run miQC over many samples and may not manually inspect the results, this may be a thing you want to use.
This PR addresses the second points discussed in #105 of whether or not we can introduce usage of miQC as a secondary filtering step after
emptyDrops
. Here, I am comparing 2 of the single-cell samples and 2 of the single-nuclei samples that have been run through Alevin-fry with selective alignment to thesplici
index with the--unfiltered-pl
filtering option andcr-like-em
resolution (configurations chosen for processing). These samples have previously been filtered to remove any potential empty droplets usingemptyDrops
withlower=200
as described in #128. I am keeping this in a draft stage right now because as we decide the thresholds to use foremptyDrops
, I think we will want to update the input to this notebook.For the most part I chose to follow the miQC vignette and started by using the
plotMetrics()
function to inspect each of the samples and see if they fit the assumptions of the model. One of these assumptions is that they have cells with a range of mitochondrial contents, which when looking at theplotMetrics()
output for the single-nuclei samples (SCPCR000118, SCPCR000119), it appears that this is not the case.I went ahead and applied miQC filtering using different posterior cutoffs and different non-linear mixture models in addition to the default settings and observed relatively similar results, regardless of the model that is used for the single-nuclei samples. I then compared the filtering to use of pre-determined thresholds to see how many cells are being removed and how filtering affects the distribution of UMIs and genes detected/cell.
Some questions that I still have: