AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Comparison of emptyDrops to emptyDropsCellRanger on Alevin-fry unfiltered output #128

Closed allyhawkins closed 2 years ago

allyhawkins commented 3 years ago

In this PR, I am addressing the first point in #105, which is if use of the DropletUtils::emptyDropsCellRanger() function for filtering of empty droplets provides an alternative option for filtering the unfiltered output from using Alevin-fry with the --unfiltered-pl option. Previously, we had found that use of DropletUtils::emptyDrops() resulted in much higher cell numbers than the numbers of cells reported from Cell Ranger on the same samples, and that cells not filtered by emptyDrops(), but filtered in the Cell Ranger output led to a lower UMI/cell and genes/cell distribution and median. DropletUtils::emptyDropsCellRanger() aims to mimic the filtering observed in Cell Ranger and Star Solo so here I tested it in comparison to emptyDrops and then looked at the number of cells and distributions of UMI/cell and genes/cell after filtering.

A few things to note for this notebook:

For reviewers, are there any other metrics that would be helpful to see? Or any other modifications of parameters for the filtering methods?

DongzeHE commented 3 years ago

Hi @allyhawkins,

I notice that this PR mentioned the emptyDropsCellRanger() we are working on. To be honest, I would not expect that the result of this function is very different from that of emptyDrops() for single cell experiments.

The reason we implemented that function is we found that for single nucleus datasets, the knee finding method of alevin-fry sometimes returns much more high-quality barcodes than STARsolo and Cell Ranger. This is because usually, single-nucleus captures fewer reads than single-cell, which will make the knee finding method of alevin-fry very hard and conservative since the UMI count rank plot will be very flat. This also happens when applying emptyDrops() on the unfiltered count matrix. However, this will not affect STARsolo and Cell Ranger a lot because they have some hard thresholds (same as the arguments we used in emptyDropsCellRanger()). As there is no standard way to predict high-quality nuclei for single nucleus datasets, we decided to implement this function to make sure that Alevin-fry can at least get a similar result as other tools.

The version of emptyDropsCellRanger() you used is not the finalized version, but I think the final version will come out soon.

Sincerely, Alevin-fry team

allyhawkins commented 3 years ago

Hi @DongzeHE, thanks so much for your reply and this insight! We too had noticed that the knee method with single-nucleus samples in Alevin-fry was resulting in filtering out many more cells than would have been filtered with Cell Ranger, which was part of our motivation to use the --unfiltered-pl approach rather than rely on the knee filtering. Here, we are testing this method on both single cell and single nucleus samples and currently I'm seeing the same affect on both of them. We will be on the lookout for the updated version of emptyDropsCellRanger.

I'm curious if you have found that using different hard thresholds for single nuclei samples and single cell samples are needed since it's expected that single nuclei have lower counts?

jashapiro commented 3 years ago

It looks from https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#emptydrop-like-filtering like the CellRanger lower bound on UMI is 500 (umiMin), though it is a bit hard to tell from this exactly how this is applied. It is also hard to tell exactly how the various variables interact. In particular, while I can imagine that umiMinFracMedian sets a minimum bound based on a fraction of the median UMI count, but which takes priority in the interaction between umiMin and umiMinFracMedian? Is it always the larger or the smaller? I can imagine arguments for either choice!

Either way, it seems to me that a higher threshold is a quite reasonable adjustment to make. I do not think there is likely to be much value in a 500 UMI cell or nucleus. I would be curious to know if EmptyDrops was eliminating most of those cells anyway with the lower (200) threshold.

DongzeHE commented 3 years ago

Hi @allyhawkins,

The emptyDropsCellRanger function in the cellranger branch is ready. We still need to improve the description, but the function itself does what it is expected to do.

As for your question, we think the answer is yes, you do need different hard thresholds for single-cell and single-nucleus samples. This is because the default values of those hard thresholds are just heuristics, chosen by 10X and then propagated to STARsolo, so there is no reason to believe they should apply equally well to all experiments. However, we don't know what the best thresholds are, as they may vary from experiment to experiment. If testing different values is a part of your analysis, we will look forward to hearing your conclusion!

As for how the parameters interact, here is the relevant part in STARsolo, and here is how we implement it in R.

Sincerely, Dongze

allyhawkins commented 3 years ago

In particular, while I can imagine that umiMinFracMedian sets a minimum bound based on a fraction of the median UMI count, but which takes priority in the interaction between umiMin and umiMinFracMedian? Is it always the larger or the smaller? I can imagine arguments for either choice!

Based on going through some of the R code, it looks like the priority is going to be the smaller of the two.

Either way, it seems to me that a higher threshold is a quite reasonable adjustment to make. I do not think there is likely to be much value in a 500 UMI cell or nucleus. I would be curious to know if EmptyDrops was eliminating most of those cells anyway with the lower (200) threshold.

So from what I can tell it looks like the Alevin-fry paper used a lower=500 for emptyDrops and a umiMin=500 for the emptyDropsCellRanger function (which is the default). We could try lower=500 as well, but I tend to agree that I'm sure many of those closer to the lower bound are going to be thrown out regardless. I was trying to be consistent with Cell Ranger without being overly stringent.

jashapiro commented 3 years ago

Based on going through some of the R code, it looks like the priority is going to be the smaller of the two.

Based on the code here, I drew the other conclusion; it is the larger of the specified minimum and the value calculated from the umi.min.frac.median (which uses the median of the top set of cells, not the whole population, which can lead to other complexity depending on how that top set gets defined!)

In any case, I would like to see the effect of the cutoff at 500, since that is what is used elsewhere. A higher cutoff may make the calculations a bit more efficient as well, since it will require fewer tests.

jashapiro commented 3 years ago

I was trying to be consistent with Cell Ranger without being overly stringent.

I think we can err on the side of consistency with common parameters as long as it seems to makes sense. As always, we have the out of telling people to go use the unfiltered data and do their own filtering.

That said, I would also like to see a "bad" example here: do we have some snRNA-seq data with a knee plot that looks wonky, so we can look at what happens in that case? I'd be curious what happens with low count data, and the examples here don't seem to cover that situation very well.

allyhawkins commented 3 years ago

In any case, I would like to see the effect of the cutoff at 500, since that is what is used elsewhere. A higher cutoff may make the calculations a bit more efficient as well, since it will require fewer tests.

Sounds reasonable, I will go ahead and add in lower=500 to the comparisons.

That said, I would also like to see a "bad" example here: do we have some snRNA-seq data with a knee plot that looks wonky, so we can look at what happens in that case? I'd be curious what happens with low count data, and the examples here don't seem to cover that situation very well.

We do have two examples of bad snRNA-seq data that I can add in here. I haven't yet processed them with alevin-fry cr-like-em since we switched to using that and the most recent version of Alevin-fry, but I don't think that will be a confounder here as the filtering was still done using --unfiltered-pl.

jaclyn-taroni commented 3 years ago

I just looked at the HTML file to get a quick idea about these results and one thing that jumped out to me when looking at plots, etc. is that I don't know anything about the samples aside from their identifiers if I'm looking at the plots. I can guess which are single-cell and single-nuclei, but we don't want folks to guess if they happen to look at this!

allyhawkins commented 3 years ago

@jashapiro I went ahead and updated this PR with the addition of 2 "bad" single-nuclei samples (SCPCR000118 and SCPCR000119), added in testing of emptyDrops with lower=500, and also added a tag for either cell or nuclei to the sample names so that it is present in all of the plots.

One thing to note is that in updating this PR, I used the most recent version of emptyDropsCellRanger that has an updated version to the way the ambient profile is calculated in this commit: https://github.com/MarioniLab/DropletUtils/commit/ebf214dfc60da314ba8324d18575e05a79e5b600 Because of this, the results have changed and we see that emptyDropsCellRanger is more similar to Cell Ranger and emtpyDrops with higher thresholds used for lower.

allyhawkins commented 2 years ago

The one thing I think this leaves a bit unresolved is a recommended action plan. Based on my read of the results, I would probably go with lower=200, as I might rather be a bit more lenient than less, but I could probably be convinced to go the other way as well! If emptyDropsCellRanger were stable and released, I might go with that, but as it is not, I don't think it is the right choice at this time. We may want to revisit if that changes before we process everything.

I went back and looked at this again and it looks like lower=200 usually gives a cell number either equal to cell ranger or a little bit higher, while lower=500 is either equal or a little bit lower. So it really is a question of how lenient we want to be. My only pro for going with the lower=500 is that was what Alevin-fry uses in their paper and it maybe slightly more consistent with emptyDropsCellRanger (which I also agree I would pick in general if it were slightly more developed). I think I could really be convinced either way.

jashapiro commented 2 years ago

The one thing I think this leaves a bit unresolved is a recommended action plan. Based on my read of the results, I would probably go with lower=200, as I might rather be a bit more lenient than less, but I could probably be convinced to go the other way as well! If emptyDropsCellRanger were stable and released, I might go with that, but as it is not, I don't think it is the right choice at this time. We may want to revisit if that changes before we process everything.

I went back and looked at this again and it looks like lower=200 usually gives a cell number either equal to cell ranger or a little bit higher, while lower=500 is either equal or a little bit lower. So it really is a question of how lenient we want to be. My only pro for going with the lower=500 is that was what Alevin-fry uses in their paper and it maybe slightly more consistent with emptyDropsCellRanger (which I also agree I would pick in general if it were slightly more developed). I think I could really be convinced either way.

I agree. My point is mostly that it would be worth including these overall thoughts in the document, as that is what we will likely go back to first rather than the series of github comments on the PR!

allyhawkins commented 2 years ago

I agree. My point is mostly that it would be worth including these overall thoughts in the document, as that is what we will likely go back to first rather than the series of github comments on the PR!

Good point! I went ahead and added a section with conclusions, summarizing the points in this discussion so we have it there for reference.