Benchmarking comparison of Alevin-fry configurations to cellranger

Closes #92. In this PR, I'm adding a notebook to compare the different configurations of alevin-fry to Cellranger with both single-cell and single-nuclei samples. Specifically, we were interested in answering the following questions about use of Alevin-fry:

Should we be using the splici index for single-cell samples rather than the cDNA index based on recent data showing a false increase in gene expression in pseudoaligners - https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/
Should we be using the cr-like or the full resolution - Use of the splici index with cr-like warrants use of the USA mode by alevin-fry which allows counts from only spliced reads and ambiguous reads to be counted and excludes and reads that are intronic.
Is use of pseudoalignment (sketch) still a viable option as we consider these new conditions?

I have tested all of these alevin-fry conditions on 3 single-cell samples and 4 single-nuclei samples. I then created three dataframes with combined colData from each of these analysis, combined rowData from each of these analysis, and associated information about the configurations for each run. These dataframes were created in #99 and used to generate the plots in this PR to look at the following metrics:

mitochondrial content
UMI/cell
genes detected/cell
correlation of UMI/cell
correlation of mean gene expression

The analysis done here is very similar to the previous benchmarking since we have added new conditions, but if there is interest in looking at additional metrics than we can add those as well. In general, alevin-fry performs very similarly to cellranger in all of the conditions, with the following observations:

Alevin-fry tends to detect more cells than cellranger, and in those additional cells the counts tend to be lower, this is probably a filtering problem and less of an Alevin-fry problem
the full resolution shows higher UMIs/cell and genes detected/cell than both the cr-like and cellranger - I'm curious if others have thoughts on choosing a resolution type
splici performs similarly to the cDNA index in single-cell samples, but it does come at a slightly higher cost than the cDNA

This PR is also stacked on #99 and that will need to be merged first.

Hi @allyhawkins,

These are great! I really do enjoy seeing the thorough analyses that you are doing across these samples. I will dig through this more deeply later and give any other thoughts, but one thing I just wanted to point out, since I wasn't sure of your index construction procedure itself, is that we find using the sparse index reduces the index size with little to no cost in mapping time. It has no bearing on the mappings themselves (dense and sparse indices give identical mapping results). To build the sparse index you simply pass the --sparse flag to salmon when constructing the index.

The other point worth raising is that I'm not certain what tool versions are used here (since we've been bumping alevin-fry quite quickly while working on the manuscript), but in the pre-print we used salmon v1.5.1 and alevin-fry v0.4.0. Compared to older versions these do have a few small tweaks that were designed to slightly improve accuracy by increasing the counts for a handful of genes (under the cr-like rule, if UMIs map to > 1 gene with the same read count, they are discarded); these new versions added a few heuristics to minimize those scenarios where feasible. Also, --cr-like-em will probabilistically allocate (rather than toss) these "ties"; though I'm not sure if you want to add yet another variable into the mix.

Finally, one question I had was how did you use the --full resolution with the splici index? Alevin-fry should check the resolution mode being passed, and should only allow either --cr-like or --cr-like-em when the splici index is being used (i.e. when a 3-column tsv is passed to the -m parameter).

Best, Rob

Hi @rob-p, thanks for your feedback and help as always! To answer your question about the version, this is with samples that were run using alevin-fry 0.3.0 and salmon 1.4.0, so as we move forward with alevin-fry we will update the workflows to use the newer versions.

Thanks for the tip about the index construction as ours were not generated using the --sparse construction. In general, the increase with splici is still much smaller than the memory and time usage to use cellranger though, especially with the single nuclei samples.

As for your question about using the splici index with the full resolution. I am referring to using the splici index as any configuration that used the index that was built with both the reads from spliced cDNA and the intronic reads included. We have generated both the 2 column and 3 column tx2gene.tsv file from this index, so in the case of using the full resolution, we are passing the 2 column file through. For the cr-like, we pass the 3 column one through. What that also means is that the outputs are slightly different and if we have used the full resolution with splici, then we grab all of the rows in the counts matrix corresponding to intronic reads and collapse those counts with the corresponding spliced cDNA from that gene to make sure the matrix is summarized to gene level. It's a similar approach to how you guys are doing USA mode where you collapse the reads for S + A reads. Here, we are either removing intronic reads from the counts matrix (in the case of single-cell) or combining reads from intronic reads with the spliced cDNA reads (in the case of single-nuclei).

Based on reading through the alevin-fry paper, it appears to me that there is a strong argument for using the cr-like or cr-like-em approach rather than the full resolution as we were doing. In our hands, it appears that the cr-like is producing more similar UMIs/cell and genes/cell as cellranger, which makes sense as it has the same approach of throwing out multi-mapped reads like cellranger. I'm curious if you tested whether or not use of the full resolution or any of the other resolution strategies had an effect on detection of the false gene expression, similar to what is identified in the absence of using the splici index? I know the full resolution uses EM to resolve multi-mapped reads, probably explaining the increased numbers, but that doesn't necessarily seem like a bad thing to me. Is there any evidence for why increased UMIs and genes with the other resolution strategies other than cr-like could be a bad thing? I'm just trying to get a better sense of the resolution strategies and why we might want to pick one over the other.

Hi @allyhawkins,

Thanks for your detailed response! My thoughts are below:

Hi @rob-p, thanks for your feedback and help as always! To answer your question about the version, this is with samples that were run using alevin-fry 0.3.0 and salmon 1.4.0, so as we move forward with alevin-fry we will update the workflows to use the newer versions.

Awesome! I realize that things on the alevin-fry side in particular have been changing quickly as we converged on getting the pre-print out. While I anticipate we will still continue to add/improve things, I think there should be some more stability now.

Thanks for the tip about the index construction as ours were not generated using the --sparse construction. In general, the increase with splici is still much smaller than the memory and time usage to use cellranger though, especially with the single nuclei samples.

Right; even with the dense index, alevin-fry should be pretty memory frugal. In all of our testing everything could be done comfortably for all of the datasets within 16G of RAM. However, with the sparse index, all data in the paper could be processed in <8G of RAM, and fraction of them even under 4GB. Interestingly, this is without any consistent slowdown in runtime. Of course, if the dense index serves your needs sufficiently, then this capability is just a bonus.

As for your question about using the splici index with the full resolution. I am referring to using the splici index as any configuration that used the index that was built with both the reads from spliced cDNA and the intronic reads included. We have generated both the 2 column and 3 column tx2gene.tsv file from this index, so in the case of using the full resolution, we are passing the 2 column file through. For the cr-like, we pass the 3 column one through. What that also means is that the outputs are slightly different and if we have used the full resolution with splici, then we grab all of the rows in the counts matrix corresponding to intronic reads and collapse those counts with the corresponding spliced cDNA from that gene to make sure the matrix is summarized to gene level. It's a similar approach to how you guys are doing USA mode where you collapse the reads for S + A reads. Here, we are either removing intronic reads from the counts matrix (in the case of single-cell) or combining reads from intronic reads with the spliced cDNA reads (in the case of single-nuclei).

Ahh, that makes sense; you are doing post-processing of the quantification matrix to extract the intended values. I think what you are doing here makes sense, though we haven't tested this type of "post-quantification" processing of the values derived under these different resolution strategies with the splici index. More on this below.

Based on reading through the alevin-fry paper, it appears to me that there is a strong argument for using the cr-like or cr-like-em approach rather than the full resolution as we were doing. In our hands, it appears that the cr-like is producing more similar UMIs/cell and genes/cell as cellranger, which makes sense as it has the same approach of throwing out multi-mapped reads like cellranger.

Right, so there are certainly technical differences between all of our resolution methods and exactly what cell ranger does (of course, we also have some points in the paper that suggest that the cell ranger resolution method may not always be ideal). The cr-like method will discard gene multimappers (if the fragment frequencies are tied), but the cr-like-em method will actually resolve these gene multimapping reads probabilistically. The main difference then between something like full and cr-like-em are largely in the exact procedure used to determine how UMIs should be allocated in the case when fragments with the same UMI map between genes or when fragments map among distinct transcripts within a gene.

I'm curious if you tested whether or not use of the full resolution or any of the other resolution strategies had an effect on detection of the false gene expression, similar to what is identified in the absence of using the splici index? I know the full resolution uses EM to resolve multi-mapped reads, probably explaining the increased numbers, but that doesn't necessarily seem like a bad thing to me. Is there any evidence for why increased UMIs and genes with the other resolution strategies other than cr-like could be a bad thing? I'm just trying to get a better sense of the resolution strategies and why we might want to pick one over the other.

So these are great questions. We are working on expanding the full suite of resolution methods to work natively with splici, so that if you want to use those methods you should be able to do that without having to do the post processing yourself. To see how much of the difference in observed UMI count is due to discarding multimappers, you can test against cr-like-em. This almost universally (as expected) produces slightly higher counts than cr-like, though we haven't extensively compared this to full on the splici index, as that is not natively implemented yet.

Finally, while I don't have a lot of data to offer on this, but I strongly doubt that the increased count you see in full is due to the types of false positives that result from not using the splici index. Generally, those false positives arise from spuriously attributing intron-derived reads to spliced transcripts (either of the same or of a different overlapping gene), and the UMI resolution method itself shouldn't move the needle too much on that. In this case the difference is likely due more to a combination of (a) retaining gene multi-mapping reads where cr-like and CellRanger are discarding them and (b) splitting some UMIs within a gene when they cannot be uniformly attributed to a consistent set of compatible isoforms. I think throwing cr-like-em in the mix may give some idea of the relative contribution of both of these causes.

Let me know if you have any other questions; I'm happy to chat more about any of this!

Best, Rob

This was really helpful, thanks for the clear explanation @rob-p!

In this case the difference is likely due more to a combination of (a) retaining gene multi-mapping reads where cr-like and CellRanger are discarding them and (b) splitting some UMIs within a gene when they cannot be uniformly attributed to a consistent set of compatible isoforms. I think throwing cr-like-em in the mix may give some idea of the relative contribution of both of these causes.

This makes complete sense to me, and my naive interpretation of our results would show me that we are increasing our read count/cell by allowing those multi-mapped reads to get resolved rather than throwing them out. I think the question we will have to answer on our end is whether or not we want to use a resolution that throws out those multi-mapped reads or not which is to be determined. Thank you again for your help on this!

Thanks for the feedback @jashapiro! I went ahead and simplified some things, but also modified the plots just slightly. I realized I had forgotten to address the differences between using the pseudoalignment and selective alignment. Although in the single-cell samples the difference seems minor, there does seem to be a difference in the single-nuclei samples so I included it in the visualization of each of the plots so we can see how each of the variables we are testing impacts the metrics.

I would agree that alevin-fry is the way to go and I think a quick update of the versions and test of the cr-like-em would be a good move. I'm curious about your thoughts on the alignment modes.

AlexsLemonade / alsf-scpca

Benchmarking comparison of Alevin-fry configurations to cellranger #101