COMBINE-lab / alevin-fry

🐟 πŸ”¬πŸ¦€ alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
156 stars 15 forks source link

How to make alevin-fry --expect cells to be better at guessing the number of cells? #102

Open ljudevitluka opened 1 year ago

ljudevitluka commented 1 year ago

Discussed in https://github.com/COMBINE-lab/alevin-fry/discussions/101

Originally posted by **ljudevitluka** January 31, 2023 Hi all, I am working on the scRNAseq dataset generated with the 10x Genomics platform. We aimed to collect 10, 000 cells per experiment, and we mapped the reads to the transcriptome (since no genomic reference is available at the moment). Everything seems to be working more or less ok with alevin-fry, but in my samples, I always get the overestimation of the cell number (as per the graph bellow). I am aware we could use --force-cells, but would be more comfortable if the knee is detected by the --expect-cells flag. I tried calling --expect-cells 5000, 8000, and 10000 cells but the results were always overestimated. My question is: Is there a way to make --expect-cells more sensitive? If I use --force-cells how will this affect my downstream analysis if some of the samples are carrying noise? If there is no automated way, is it better to take a bit of noise and hope it will get filtered out based on the further cell QC (mitoDNA genes content, ribosomal content etc.) or to be on the secure side and take a bit less cell with a risk of loss of the cells that have a lower number of genes expressed? Thank you very much for your help and opinions :) Hopefully, this question is not out of the scope of discussion. All the best, Luka ![image](https://user-images.githubusercontent.com/55898920/215842266-21e2c1b2-bbeb-4033-84d6-5c02f10018d9.png)
rob-p commented 1 year ago

Hi @ljudevitluka,

Thanks for the question! What is your intended workflow? I think there may be a slight disconnect between the intended usage of expect-cells and your understanding.

In general there are several ways to generate a permit list with alevin-fry:

  1. The knee method
  2. Expect cells
  3. Force cells
  4. Unfiltered permit list
  5. Valid bc permit list

As you suggest, force cells forces a specific number of cells to be quantified (the top N if you pass the argument N, as long as there are at least N distinct barcodes in the input).

The purpose of expect cells is to provide a liberal quantification for a number of cells around the number you specify. The way it actually achieves this is motivated by what Cell Ranger does β€” taking the frequency distribution up to the number you specify, looking at the 99th percentile, and then including anything that has up to 1/10th that number of barcodes. As you suggest, the idea here is that it is better to quantify more cells, under the assumption that they can later be filtered out if they are of low or dubious quality once you are doing your analysis.

Now, I'd expected the knee method to be most inline with what you seem to suggest from the plot. The purpose of the knee method is to find the cutoff at the knee of the rank plot. This method doesn't take an argument, and build the plot and attempts to find the knee itself. This method usually works well, however it tends to be quite conservative in the number of cells it calls β€” i.e. it errs on the side of excluding cells from quantification rather than including them.

However, since you are using 10x Chromium technology, what I'd actually recommend as a default pipeline is to use --unfiltered-pl. This allows you to pass in the unfiltered permit list (e.g. the 10x v2 or 10x v3 "whitelist") to alevin-fry. It will then quantify all cells above a nominal count threshold (default 10 reads). This will result in a count matrix with many cells, but it can then be intelligently filtered in downstream analysis using an intelligent algorithm such as DropletUtils::emptyDrops. In addition to filtering by e.g. mitochondrial gene content, this will give your downstream analysis pipeline the most information to distinguish between "high-quality" and "low-quality" cells. That label is highly correlated with read/UMI count, but it's not one-to-one.

Let me know if the above makes sense, or if you have any other questions. Also, looping in @DongzeHE as he may have thoughts as well.

Best, Rob

DongzeHE commented 1 year ago

Hi all,

I totally agree with what @rob-p said. One thing to add: If you have an expected number of cells in your mind, you can also try to run the DropletUtils::emptyDropsCellRanger by setting n.expected.cells = 10000. This function mimics the behavior of CellRanger's cell-calling strategy. Usually, this function will return more cells than DropletUtils::emptyDrops.

Best, Dongze

ljudevitluka commented 1 year ago

Dear @rob-p @DongzeHE,

thank you very much for your help and explanations!
At first, I actually tested all methods 1-4, to find the "best" way to get a list of valid cells out of my data. In the end, choosing --force-cells with setting <20 000> seemed reasonable, as it is more than what we placed in the library prep and less than what both --expect-cells and --knee-distance gave ~70000 calls.

Based on your reply, a good approach would be to use --unfiltered-pl with the 10x v3 "whitelist" settings and proceed with DropletUtils to generate a list of high-quality cells.

We are comparing 4 states: healthy (control) vs. infected vs. vaccinated+infected vs. "placebo"-vaccinated+infected; all conditions in duplicates. My intended workflow is to Quality filtering per sample > Generate a high-quality cell list per sample> Merge high-quality cells from all samples together > Normalise > Cluster > Identify specific/enriched cell types and if possible do differential expression analysis. Hopefully, this makes sense.

All of the analyses are very new to me, so yes I think my understanding was probably wrong. I was thinking that after mapping the list of cells that are generated with alevin-fry is almost complete, and afterward only mtDNA content and rRNA content are the values used for filtering.

Best,

Luka