Preliminary benchmarking of counts output for scRNA-seq pre-processing tools

AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.

BSD 3-Clause "New" or "Revised" License

0 stars 1 forks source link

Preliminary benchmarking of counts output for scRNA-seq pre-processing tools #82

Closed allyhawkins closed 3 years ago

allyhawkins commented 3 years ago

Related to #63, this is a start to some comparisons of quantification output of 4 pre-processing tools, Cellranger 6.0, Alevin, Alevin-fry (with/without --sketch), and Kallisto. I used four 10Xv3 samples that have been processed using all 5 tools.

I updated the 01-import-quant-data.Rmd notebook to include importing output data from all of these tools and for all 4 of these samples. That notebook is used to create a single cell experiments for all samples which are the input for the pre-processing-benchmarking-metrics.Rmd.

I started with some basic comparisons including number of cells detected and shared between tools, number of UMI/cell, genes/cell, and correlation of these metrics for each tool and cellranger.

One thing that I think would be really nice that is missing from this is creation of a knee plot, but we would need to output information about every cell from each tool, not just those that each tool considers "true" cells. I'm not sure if that's entirely necessary, but I do think it would be nice to have.

Additionally, one of the samples, SCPCR000003, is a very poor sample and appears to be mostly dead, so I chose to remove that sample for the majority of the comparisons.

As we move forward more samples will need to be added, since we have samples from snRNA-seq and different 10X technologies.

Some questions to reviewers:

Are there metrics that you think are missing that you would like to see added to this comparison?
Do you have any suggestions for different ways to visualize these metrics that might help us better compare these tools?

allyhawkins commented 3 years ago

@jashapiro Thank you for all the great feedback! I went ahead and removed all of the summary data and now have the plots separated to show each sample across each tool and when applicable if there are different indices that are used.

We have quite a few samples that have been run at this point with various configurations that I have included into one updated notebook that looks at only the cell level metrics. I am going to split it up to look at gene level metrics in another notebook and PR (per your suggestion, and the notebook was getting to be very long and overwhelming).

For your reference, there are 4 scRNA-seq samples that have been used to this point: SCPCR000003, SCPCR000006, SCPCR000126, and SCPCR000127 All of the scRNA-seq samples have gone through the following tool/index combinations:

Cellranger mRNA only: labeled as cellranger_cdnapre_mRNA
Cellranger pre-mRNA: labeled as cellranger_cdna
Alevin with the ensembl generated index + full decoy: labeled as alevin
Kallisto with the ensembl generated index: labeled as kallisto
Alevin-fry with the ensembl generated index: labeled as alevin-fry
Alevin-fry with the ensembl generated index in --sketch mode: labeled as alevin-fry-sketch
Alevin-fry with the spliced index in --unfiltered mode: labeled as alevin-fry-unfiltered
Alevin-fry with the spliced index in --unfiltered mode and --sketch mode: labeled as alevin-fry-unfiltered-sketch

We also now have 2 snRNA-seq samples that have been added: SCPCR000118 and SCPCR000119. They have been run through the following tool/index combinations:

Cellranger mRNA only: labeled as cellranger_cdnapre_mRNA
Cellranger pre-mRNA: labeled as cellranger_cdna
Alevin with the spliced index + full decoy: labeled as alevin*
Kallisto with the spliced index: labeled as kallisto*
Alevin-fry with the spliced index: labeled as alevin-fry*
Alevin-fry with the spliced index in --sketch mode: labeled as alevin-fry-sketch*
Alevin-fry with the spliced index in --unfiltered mode: labeled as alevin-fry-unfiltered*
Alevin-fry with the spliced index in --unfiltered mode and --sketch mode: labeled as alevin-fry-unfiltered-sketch Those indicated with a were also run with the spliced_intron index.

In general, I am a bit concerned with the number of cells that alevin-fry (regardless of mode) is detecting. It doesn't quite seem feasible for that to be happening. I'm going to go back and look at the workflow to make sure there aren't any issues with how we are implementing the --unfiltered workflow, but seeing ~50-60,000 cells in a sample doesn't seem ideal.

The other striking finding is in the single nuclei samples. In general, the counts are much lower per cell, as expected, but interestingly it looks like kallisto and cellranger are consistently performing better than the other options. Also, using the pre-mRNA index definitely looks beneficial. Although kallisto does have an increase in memory usage with snRNA-seq.

Questions I have for reviewers:

Any ideas on why we could be seeing such a large increase in cell numbers deteced in alevin-fry --unfiltered?
Should we try other snRNA-seq samples to see if this same trend of low coverage is consistent?
Any thoughts on other metrics that would be good to include here?

jashapiro commented 3 years ago

Any ideas on why we could be seeing such a large increase in cell numbers deteced in alevin-fry --unfiltered?

Yes, I have an idea! Caveat: I haven't actually looked at anything.

I think alevin-fry is probably now behaving like kallisto does, and returning everything that matches a known barcode. --unfiltered is really unfiltered. (What I think I would like to see is them combine the barcode list and knee filtering, so that they don't create cells that don't match the canonical barcodes, but I can see why they might not have done that yet. I may request it...)

In the mean time, we can probably pull a bit of the code I used to preprocess kallisto and filter with DropletUtils::barcodeRanks or similar.

rob-p commented 3 years ago

@jashapiro --- you hit the nail directly on the head. The algorithm used in the unfiltered mode is described here. Basically, it's the following. Any barcode that appears in the provided 10x "whitelist" (I will henceforth use the term permit list instead) will constitute the "existence" or "presence" of a corresponding cell. Once all of the "present" 10x permit list barcodes (there are ~6.7M possible barcodes in the 10x v3 permit list) are found, the reads with barcodes that did not exactly match are checked against this list to see if they can be corrected with a single edit to any of the "present" barcodes. If they can be corrected uniquely to a single present barcode, the read is rescued, otherwise, if there is no "present" barcode within 1-edit (or if there are >= 2 barcodes within 1-edit), the read is discarded. By default, the only filtering applied is that a barcode must have 10 reads assigned to it in the first pass to be considered present (thought this, too, can be controlled by the --min-reads parameter, and could be set to 1 to be most permissive or higher if you really don't want to consider any barcodes with fewer reads).

To be clear, as is the case with other "unfiltered" output, this is not a prediction of all of the "high-quality" cells. This is a quantification of everything present (by means of intersection with the 10x permit list of possible barcodes). So, to determine the "high quality" cells --- i.e. the barcodes that are likely to have corresponded to live cells and not e.g. empty droplets, the quantified output in the --unfiltered mode should be combined with another method for estimating the true number of cells in the sample / experiment.

I'm happy to discuss further if you have any follow-up questions or suggestions here.

rob-p commented 3 years ago

@allyhawkins, there are also two other things that I think are likely worth considering here (after deciding how to handle the filtering of the --unfiltered output ... since this is certainly leading to the quantification of a large number of "present" barcodes that do no correspond to true cells, which is also likely to drive up the mito. reads you're seeing in that pipeline).

The first is that considering the number of detected UMIs / cell is a bit of a problematic metric when the tools are not doing alignment filtering in the same way. Specifically, 10x has noted that in the single-nucleus data, there tend to be a high number of detected reads aligning in the antisense orientation (upwards of 30%). These are being filtered out by orientation (at least under the current settings) all of the alevin tools and in cell ranger too I believe; so looking at UMIs/cell is a bit apples-to-oranges. If you want to see what the alevin-fry counts look like when including these reads, you can just pass --d both in the generate-permit-list step.

The second thought is a more general one, buy may be particularly poignant in the single-nucleus data and when looking at the number of UMIs/cell. That is, I think it's worth throwing the cr-like resolution method into the mix. Because of the simpler UMI deduplication approach, it will also yield more UMIs/cell, and seeing how that compares to the other results may provide a better perspective on the UMIs/cell metric.

jashapiro commented 3 years ago

A couple more quick comments on my first pass through:

It looks like there is only one sample where we have both the txome and the spliced txome files generated by us? Is that right? Either way, I think it would make sense to collapse all of the index-related categories to just cDNA (however generated) vs unspliced. Then all tools will be in the same panels, rather than spread out in ways that make it harder to compare.
I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

allyhawkins commented 3 years ago

Thank you everyone for all the comments and feedback. I can't believe I didn't even think about the fact that we were now allowing any and every identified cell through which makes complete sense as to why we are seeing that increase in cell numbers. And also why we probably see a larger shift in UMI/cell and gene/cell distributions when looking at only shared cells in the alevin-fry workflow. I will go back and add in a pre-filtering step to this analysis, similar to what Josh had previously implemented for Kallisto and start there.

It looks like there is only one sample where we have both the txome and the spliced txome files generated by us? Is that right? Either way, I think it would make sense to collapse all of the index-related categories to just cDNA (however generated) vs unspliced. Then all tools will be in the same panels, rather than spread out in ways that make it harder to compare.

Yes, that's correct. I wasn't sure if we should separate them out or not based on spliced vs. txome files, but I will go ahead and do some more data cleaning and collapse them to be cDNA vs. pre-mRNA index (unspliced). I'm also going to remove the one run of txome from the sample that does have both txome and spliced txome (119) for now.

I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

Let me go back and actually look at this, I am now remembering us talking about having to combine the counts for spliced and unspliced genes for Alevin and Kallisto based on our discussion here https://github.com/AlexsLemonade/alsf-scpca/pull/68#issuecomment-816267378. I will go through this to make sure this is accurate, because the counts do seem very low.

allyhawkins commented 3 years ago

I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

I've now included a step that collapses the counts matrix by gene for the snRNA-seq counts for alevin, alevin-fry, and kallisto so that it is comparable to cellranger.

I also incorporated a cell filtering step for alevin-fry --unfiltered-pl which now filters out any barcodes associated with empty droplets. Using just DropletUtils::barcodeRanks still left samples with ~50,000 cells, so per the suggestion on the vignette for DropletUtils, I used DropletUtils::emptyDrops to identify the probability of drops being empty and filtered those out. After doing this, the alevin-fry --unfiltered-pl runs look much more comparable to the other tools.

jashapiro commented 3 years ago

Using just DropletUtils::barcodeRanks still left samples with ~50,000 cells, so per the suggestion on the vignette for DropletUtils, I used DropletUtils::emptyDrops to identify the probability of drops being empty and filtered those out.

How long did this take? I think I had avoided it because it seemed relatively slow, though I do like the idea behind it. Speed probably doesn't matter in the grand scheme of things, especially if we implement it as part of a workflow.

allyhawkins commented 3 years ago

How long did this take? I think I had avoided it because it seemed relatively slow, though I do like the idea behind it. Speed probably doesn't matter in the grand scheme of things, especially if we implement it as part of a workflow.

It is definitely longer to use emptyDrops than barcodeRanks. Doing a time test with one of the snRNA-seq counts matrix, the elapsed time on the first line is for barcodeRanks, while the second line is for emptyDrops. It is orders of magnitude greater, but will only add a little over 1 minute to processing time for each sample.

allyhawkins commented 3 years ago

The one thing I do think is worth a change is the fact that kallisto uses a different filter than alevin-fry-unfiltered. It would only seem fair to give it the same emptyDrops filtering that we are doing for alevin-fry.

This makes complete sense. I now changed it so that both alevin and kallisto are filtered using emptyDrops, which solves the problem of the one sample having ~ 40,000 cells in Kallisto.

I also went ahead and actually did get rid of some of the repetitive code, in case future me finds it beneficial. I added in a function that imports quants data from a data directory and a list of quant ids, collapses any counts matrices aligned to the spliced_intron fasta, and then outputs a sce object that can be used for cell filtering using emptyDrops depending on the tool used.

rob-p commented 3 years ago

Wow @jashapiro : what is up with that sample. Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

jashapiro commented 3 years ago

Wow @jashapiro : what is up with that sample. Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

No idea! And absolutely could be (likely?) something at our end as well! We will probably go back to investigate a bit more, but it is not today's problem. No correlation I could have understood as a random permutation, but a negative correlation? The pattern is strange too (plot from @allyhawkins).

The good part is we don't really care about cell ids at the moment, but we definitely will when we get to integrating some CITE-seq data.

rob-p commented 3 years ago

Looks like we're directly visualizing the X chromosome 😆 . But more seriously, if, after further investigation, it does look like something strange on our end, please do keep us posted.

allyhawkins commented 3 years ago

Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

@rob-p So before doing any filtering, we get around 69,000 cells identified for this sample. After filtering this sample using DropletUtils::emptyDrops we get ~2,000 cells, and 1,500 of those cell barcodes are also identified in this sample when running through cellranger. We will definitely dig around a bit more and see if we can come up with anything and keep you posted.