Open ljudevitluka opened 1 year ago
Hi @ljudevitluka,
Thanks for the question! What is your intended workflow? I think there may be a slight disconnect between the intended usage of expect-cells
and your understanding.
In general there are several ways to generate a permit list with alevin-fry:
As you suggest, force cells forces a specific number of cells to be quantified (the top N if you pass the argument N, as long as there are at least N distinct barcodes in the input).
The purpose of expect cells is to provide a liberal quantification for a number of cells around the number you specify. The way it actually achieves this is motivated by what Cell Ranger does β taking the frequency distribution up to the number you specify, looking at the 99th percentile, and then including anything that has up to 1/10th that number of barcodes. As you suggest, the idea here is that it is better to quantify more cells, under the assumption that they can later be filtered out if they are of low or dubious quality once you are doing your analysis.
Now, I'd expected the knee method to be most inline with what you seem to suggest from the plot. The purpose of the knee method is to find the cutoff at the knee of the rank plot. This method doesn't take an argument, and build the plot and attempts to find the knee itself. This method usually works well, however it tends to be quite conservative in the number of cells it calls β i.e. it errs on the side of excluding cells from quantification rather than including them.
However, since you are using 10x Chromium technology, what I'd actually recommend as a default pipeline is to use --unfiltered-pl
. This allows you to pass in the unfiltered permit list (e.g. the 10x v2 or 10x v3 "whitelist") to alevin-fry
. It will then quantify all cells above a nominal count threshold (default 10 reads). This will result in a count matrix with many cells, but it can then be intelligently filtered in downstream analysis using an intelligent algorithm such as DropletUtils::emptyDrops
. In addition to filtering by e.g. mitochondrial gene content, this will give your downstream analysis pipeline the most information to distinguish between "high-quality" and "low-quality" cells. That label is highly correlated with read/UMI count, but it's not one-to-one.
Let me know if the above makes sense, or if you have any other questions. Also, looping in @DongzeHE as he may have thoughts as well.
Best, Rob
Hi all,
I totally agree with what @rob-p said. One thing to add: If you have an expected number of cells in your mind, you can also try to run the DropletUtils::emptyDropsCellRanger
by setting n.expected.cells = 10000
. This function mimics the behavior of CellRanger's cell-calling strategy. Usually, this function will return more cells than DropletUtils::emptyDrops
.
Best, Dongze
Dear @rob-p @DongzeHE,
thank you very much for your help and explanations!
At first, I actually tested all methods 1-4, to find the "best" way to get a list of valid cells out of my data. In the end, choosing --force-cells with setting <20 000> seemed reasonable, as it is more than what we placed in the library prep and less than what both --expect-cells and --knee-distance gave ~70000 calls.
Based on your reply, a good approach would be to use --unfiltered-pl with the 10x v3 "whitelist" settings and proceed with DropletUtils to generate a list of high-quality cells.
We are comparing 4 states: healthy (control) vs. infected vs. vaccinated+infected vs. "placebo"-vaccinated+infected; all conditions in duplicates. My intended workflow is to Quality filtering per sample > Generate a high-quality cell list per sample> Merge high-quality cells from all samples together > Normalise > Cluster > Identify specific/enriched cell types and if possible do differential expression analysis. Hopefully, this makes sense.
All of the analyses are very new to me, so yes I think my understanding was probably wrong. I was thinking that after mapping the list of cells that are generated with alevin-fry is almost complete, and afterward only mtDNA content and rRNA content are the values used for filtering.
Best,
Luka
Discussed in https://github.com/COMBINE-lab/alevin-fry/discussions/101