Benchmarking Plan for ScPCA

allyhawkins commented 3 years ago

Closing #9 and #59, the previous issues on benchmarking preprocessing workflows, and opening a new issue to start discussing a concrete plan on building the tools we need for benchmarking.

In thinking about the benchmarking, it appears that the main goal is to identify a pre-processing workflow that is going to be computationally efficient and result in the most accurate alignment and assignment of UMI’s to cells. I think there are a variety of ways that we could test this to different extremes, but it looks like there are some basic QC metrics that we would want to output for each sample:

Knee plot (distribution of UMIs/cell)
Mitochondrial content of cells
UMIs/gene
genes/cell

It appears that one nice way of getting many of these key metrics is using addPerCellQC in the scater Bioconductor package. Additionally we could use miQC which is designed to run with this function to make informed decisions on how to filter poor quality cells based on both mitochondrial content and number of UMI’s per cell. This would be a good workflow to explore and then create uniform plots across the samples for easy comparisons. Some ways to make direct comparisons can be to look at the following:

Correlation of average gene expression/cell across different tools in the same sample
Plot # UMI on x-axis and #gene on y-axis for all cells and show cutoff of miQC in each of the pre-processing workflows (i.e. for each workflow how many cells are we getting that are considered “poor”quality? - see figure 2c)
Processing time and Memory usage vs. Number of reads

In addition to making these comparisons across tools, we are also interested in finding a tool that will be consistent regardless of confounding variables that are present in our dataset, i.e. library preparation, tissue type, or choice of technology. In order to address these questions, we should make sure we have some samples from each of these groups.

After going through the samples that are currently listed in the ScPCA-admin/sample_info/scpca-sample-metadata.tsv and corresponding library information in scpca-library-metadata.tsv, we have 131 runs that are from single-cell vs. 16 runs from single-nucleus RNA-seq. All 16 snRNA-seq runs are from neuroblastoma, so it appears that to address any changes between single cell vs. single nucleus we should compare NB single-cell vs. NB single-nucleus samples.

We also have a mix of technology, with 10Xv2 (6 samples), 10Xv3 (87 samples), 10Xv3.1 (20 samples), and CITE-seq (34 samples). For this comparison, I am not considering CITE-seq, but we can incorporate that in future steps. It would be good to have a mix of samples from each technology, since we will continue to be getting libraries prepped by any of these 10x kits.

The last comparison to consider would be across tissue types, in particular solid vs. liquid tumors because of differences in sample preparation prior to single-cell RNA-seq. It appears that the majority of our samples received thus far are from ALL, glioma/GBM, and NB. It would make sense to me to have a mix of those tissues to compare.

One potential plan for samples to perform benchmarking would be to use the following:

Neuroblastoma, single-nucleus, 10Xv3.1 (4)
Neuroblastoma, single-nucleus, 10Xv2 (2 - we only have 2 of these)
Neuroblastoma, single-cell, 10Xv2 (4)
T-ALL, single-cell, 10Xv3 (5)
T-ALL, single-cell, 10Xv3.1 (5)
GBM, single-cell, 10Xv3 (5)

Some follow up questions for thought on this:

Are there QC metrics that are missing? Or other ways of performing QC other than using miQC that I should be looking into?
None of these QC metrics directly look at differences in UMI assignment and UMI collapsing. It looks like all of these methods have slightly different ways of dealing with UMI collapsing. Are there things that would be useful to measure to definitively look at that or are these surface level metrics okay?
Is the distribution of samples chosen okay? This gives us 25 samples for benchmarking which is around ~15% of the samples we currently have. Should we do more/less? Is the diversity okay?

Some general things that need to be done before we can proceed with benchmarking:

add --include-introns option for cellranger for single-nuclei samples
- add premature mRNA reference for Kallisto and Alevin
- Add miQC option workflow

jaclyn-taroni commented 3 years ago

I am going to think on this more, but did want to add that we're going to need to do #52 as well if we proceed as outlined above.

cgreene commented 3 years ago

In the goals for this, it seems like this will be easy:

the main goal is to identify a pre-processing workflow that is going to be computationally efficient

and this will be quite difficult:

and result in the most accurate alignment and assignment of UMI’s to cells.

Of the bullet points that follow, I'd put the most weight on processing time and similarity to other methods. Since a key goal of the project is to provide a comparison set for everyone else's single-cell datasets (as well as bulk data), something that results in a gene expression matrix that is similar to other commonly used methods is probably going to be most valuable to the community.

I don't think a pipeline that lets low-quality cells through is bad per se, because they can be easily filtered (e.g., by miQC), which you might want to actually do before you post the dataset. If you do want an approach that has a gold standard, I think the CITE-seq data is going to be your best bet at a gold standard. That's why we designed it into the RFA.

jashapiro commented 3 years ago

Just coming back and getting restarted, but I wanted to note that I had started on some similar analysis over in https://github.com/AlexsLemonade/alsf-scpca/tree/main/analysis/quantifier-comparisons which you might look at for some inspiration/messy thoughts if you hadn't already. As far as accuracy goes (& @cgreene's comment) I think my main point there was correlation with cellranger. So my current biggest question is how much faster cellranger is, I guess. (I still have a some bias toward not cellranger for this, for all the openness reasons.)

I too will need to think more before I am really ready to weigh in strongly, but my general sense is that we probably don't need to do quite as many samples as you propose here. I would probably aim toward more like 10-12 samples, but I am happy to be overruled on that. Part of me wants to trust 10x that v3 and v3.1 are comparable for analysis, and start by treating them as such.

More thoughts later...

allyhawkins commented 3 years ago

All great points, especially the point about ease of comparison to other datasets. If the goal is to be able to have this data as a resource for others to use in their own research, then having something that is either analyzed in the same way that the majority of people are using, or easily comparable to what they are using, then high correlation with cellranger would be an important metric.

Also @jashapiro you are right, it looks like the differences are extremely minor between 10Xv3 and 10Xv3.1 so we could revise the comparisons to only consider 10Xv2 vs. 10Xv3/3.1. We could then remove the T-ALL samples that are 10Xv3 to get this list:

Neuroblastoma, single-nucleus, 10Xv3.1 (4)
Neuroblastoma, single-nucleus, 10Xv2 (2)
Neuroblastoma, single-cell, 10Xv2 (4)
T-ALL, single-cell, 10Xv3.1 (5)
GBM, single-cell, 10Xv3 (5)

Although we could easily get down to the 10-12 samples that Josh was suggesting by either eliminating the GBM samples and the 10Xv2 snRNA-seq samples, or lowering the number of samples per group across the board.

We can also add in some of the CITE-seq samples here. I do believe that the T-ALL samples have matching CITE-seq data, I just haven't put much thought into processing those and what extra steps we would need, if any, to changing the workflows.

jaclyn-taroni commented 3 years ago

I am going to echo (agree with!) some of what has been posted so far.

CellRanger is widely used and we want what we provide to be as comparable as possible to other datasets out there. So the most salient questions become:

How different are the other methods in terms of efficiency? (I.e., is Salmon Alevin still much more efficient than CellRanger 6.0.0?)
How does the output of other methods compare to CellRanger? As @jashapiro mentioned, initial analyses focused on correlation. One idea additional idea that we didn't look into much is about the detected features (genes) that are different between the methods (inspired by this preprint).

I agree that we can probably filter out low-quality cells after quantification and will probably want to do that before making the summarized data available.

More general comment - we can always add more samples, but we can't do fewer once we've already invested time into running them.

jaclyn-taroni commented 3 years ago

Wanted to comment specifically on the discussion of CITE-seq here.

If you do want an approach that has a gold standard, I think the CITE-seq data is going to be your best bet at a gold standard. That's why we designed it into the RFA.

We can also add in some of the CITE-seq samples here. I do believe that the T-ALL samples have matching CITE-seq data, I just haven't put much thought into processing those and what extra steps we would need, if any, to changing the workflows.

After spending a bit of time with Hao et al. & assorted background, I am less convinced CITE-seq is a gold standard and more on the side of it depends on the expected cell types and the markers that were used. (I don't think I am expressing disagreement necessarily!) This is to say that any benchmarking of this nature should account for the fact that there may be some cell populations that are better captured by/reflected in the transcriptomic data. As far as an action item, I think first thing would be to put together a workflow (#50).

allyhawkins commented 3 years ago

Based on this and some discussion in data science team meeting today, I wanted to update this to include the current plan of action.

We will start out by running the benchmarking on a few samples to start on the workflows that are ready to go for benchmarking:

Cellranger 6.0
Kallisto
Alevin

Alevin-fry with/without sketch mode (#60 and #64) will be added to the list as soon as the workflow is finalized and ready. For these runs we will make sure that we save the output report with nextflow in order to record RAM usage and time. The main metric that we will use is similarity of output with cellranger and compute time/cost - this will allow us to determine if these methods are comparable to cellranger with a smaller footprint and worthwhile pursuing.

To start, I was thinking of using a few of the single-cell 10Xv3/v3.1 (unless there are objections or other suggestions on this?):

T-ALL, single-cell, 10Xv3.1 (2)
GBM, single-cell, 10Xv3 (2)

In the meantime, the following issues will be addressed to incorporate single-nucleus RNA-seq, differing technologies, and CITE-seq.

pre-mRNA index - AlexsLemonade/alsf-scpca#54
incorporate flexibility for 10Xv2 - AlexsLemonade/alsf-scpca#52
add in CITE-seq workflow - AlexsLemonade/alsf-scpca#50
include-introns option for cellranger

We will implement miQC and any post pre-processing filtering after benchmarking is complete.

allyhawkins commented 3 years ago

To start, I was thinking of using a few of the single-cell 10Xv3/v3.1 (unless there are objections or other suggestions on this?):

T-ALL, single-cell, 10Xv3.1 (2)

GBM, single-cell, 10Xv3 (2)

Just want to follow up with the corresponding sample Id's so we have it for reference of which one of these will be used for the first benchmarking with cellranger v6, kallisto, alevin, and alevin-fry with/without --sketch mode.

rob-p commented 3 years ago

Hi all,

I'll just note that we've recently released v0.2.0 of alevin-fry, and it's been propagated through bioconda as well. In addition to general improvements to the runtime profile (vastly improved memory usage at very high thread count), this release also adds initial (beta) support for quantification via an "unfiltered" permit list using the -u option to the generate-permit-list command; this functionality was previously unsupported. This flag will allow one to properly pass e.g. the 10x permit list to the generate-permit-list so that one can then quantify all cells, rather than just those selected by the knee threshold. There is also --min-reads command that determines the minimum number of reads (not UMIs, as this is pre-quantification) a cell barcode needs to have to be considered present. The default is 10, but you can set it at 1 if you really want to quantify everything. Let me know if you need any help or support running alevin-fry on the testing data, or if you have any questions about the different resolution options (e.g. https://github.com/AlexsLemonade/alsf-scpca/pull/60#discussion_r611078240).

Best, Rob

jashapiro commented 3 years ago

Hi @rob-p, thanks for the update! The --unfiltered-pl option sounds useful, especially for comparisons with other tools that behave more like that.

allyhawkins commented 3 years ago

Thanks @rob-p for the helpful information!

Another item that recently came up in #79, is the ability of Alevin and Alevin-fry to work with 5' 10X data. On the docs for alevin, there is a statement that Alevin can be used with 3' data, but no mention of being able to use for 5'. But @jashapiro found this https://github.com/COMBINE-lab/salmon/issues/439#issuecomment-547208043 stating that there is an option to use 5' data with alevin, although it appears there may be an effect on the mapping? Could you provide any guidance on what you think would be the best approach to quantify 5' data with alevin and how that would effect quality of the counts?

rob-p commented 3 years ago

Hi @allyhawkins,

Great question. Alevin and alevin-fry can work with 5' data, by specifying the appropriate barcode geometry. The only current caveat is that if you are using a 5' protocol where both reads have alignable sequence, only the non-barcode read will be used for transcriptomic mapping (i.e. the non barcode read will be aligned and the non barcode/umi part of the 5' read will be discarded). This can potentially reduce specificity compared to if the read was aligned as a paired end read. This limitation is temporary however, mostly just that we haven't gotten around to implementing it yet, and we should have the ability to use both reads for alignment as paired end in supporting protocols soon. I'll let you know when that feature lands, but for now it should be ok just to specify the appropriate barcode geometry.

Best, Rob

AlexsLemonade / alsf-scpca

Benchmarking Plan for ScPCA #63