Preliminary benchmarking of Memory/Time footprint for pre-processing tools

allyhawkins commented 3 years ago

In reference to the discussion in #63, here is an initial start to benchmarking specifically focused on comparing computation time and memory usage across the different pre-processing tools. To access the metrics from each nextflow run, I added the --with-trace option to nextflow run which outputs a trace.txt file containing runtime, % CPU, and memory usage.

Included in this notebook, I have run 4 samples (SCPCR000003, SCPCR000006, SCPCR000126, and SCPCR000127) through all 4 tools. All 4 of these are 10Xv3 and from solid tumors. It is pretty apparent that alevin-fry is using much less memory than the others (at or less than 5 GB/ sample), and Kallisto is the fastest. Cellranger still takes a lot of time and uses more memory. Additionally, it looks like cellranger is the only tool that may increase in runtime with increased number of reads/ sample? It is the tool with the largest variation in runtime across samples.

Next, I will add some snRNA-seq samples for comparison as they represent about 50% of our samples. I will also add another PR to address differences in quantification across the tools when that is finished. For now this seemed sufficient to get our discussion started.

rob-p commented 3 years ago

Hi @allyhawkins, I look forward to learning what insights can be drawn from your benchmarks. I'm curious if the speed measurements here are using alevin-fry in --sketch mode or with standard selective alignment, and with the cr-like (more simplistic) resolution, or the much more heavyweight full resolution. For the purposes of speed, --sketch + --cr-like should be quite fast, especially as you get to moderate thread counts (>=8 or so). Thanks!

allyhawkins commented 3 years ago

I'm curious if the speed measurements here are using alevin-fry in --sketch mode or with standard selective alignment, and with the cr-like (more simplistic) resolution, or the much more heavyweight full resolution.

Thanks for the insight! Right now, we are testing with and without --sketch mode using the full resolution mode. We definitely are seeing speed improvements, even with the full resolution, in compared to Alevin (and other tools) thus far. Although I will say this is only on a small subset of samples currently. As we dig in a little further, we will look at exploring different strategies like the cr-like mode that you suggested and the --unfiltered-pl that you mentioned earlier today. Thanks again!

rob-p commented 3 years ago

Thanks for the explanations, @allyhawkins! Btw, I think that v0.2.0 of alevin-fry should eliminate the small dependence you previously saw of the memory usage on the number of reads. We are able to process the pbmc10k_v3 data using this version with a peak memory usage of ~2.5G (even in unfiltered mode).

AlexsLemonade / alsf-scpca

Preliminary benchmarking of Memory/Time footprint for pre-processing tools #80