Closed allyhawkins closed 3 years ago
Hi @allyhawkins, I look forward to learning what insights can be drawn from your benchmarks.
I'm curious if the speed measurements here are using alevin-fry in --sketch
mode or with standard selective alignment, and with the cr-like
(more simplistic) resolution, or the much more heavyweight full
resolution. For the purposes of speed, --sketch
+ --cr-like
should be quite fast, especially as you get to moderate thread counts (>=8 or so). Thanks!
I'm curious if the speed measurements here are using alevin-fry in
--sketch
mode or with standard selective alignment, and with thecr-like
(more simplistic) resolution, or the much more heavyweightfull
resolution.
Thanks for the insight! Right now, we are testing with and without --sketch
mode using the full
resolution mode. We definitely are seeing speed improvements, even with the full
resolution, in compared to Alevin (and other tools) thus far. Although I will say this is only on a small subset of samples currently. As we dig in a little further, we will look at exploring different strategies like the cr-like
mode that you suggested and the --unfiltered-pl
that you mentioned earlier today. Thanks again!
Thanks for the explanations, @allyhawkins! Btw, I think that v0.2.0 of alevin-fry should eliminate the small dependence you previously saw of the memory usage on the number of reads. We are able to process the pbmc10k_v3 data using this version with a peak memory usage of ~2.5G (even in unfiltered mode).
In reference to the discussion in #63, here is an initial start to benchmarking specifically focused on comparing computation time and memory usage across the different pre-processing tools. To access the metrics from each nextflow run, I added the
--with-trace
option tonextflow run
which outputs atrace.txt
file containing runtime, % CPU, and memory usage.Included in this notebook, I have run 4 samples (SCPCR000003, SCPCR000006, SCPCR000126, and SCPCR000127) through all 4 tools. All 4 of these are 10Xv3 and from solid tumors. It is pretty apparent that alevin-fry is using much less memory than the others (at or less than 5 GB/ sample), and Kallisto is the fastest. Cellranger still takes a lot of time and uses more memory. Additionally, it looks like cellranger is the only tool that may increase in runtime with increased number of reads/ sample? It is the tool with the largest variation in runtime across samples.
Next, I will add some snRNA-seq samples for comparison as they represent about 50% of our samples. I will also add another PR to address differences in quantification across the tools when that is finished. For now this seemed sufficient to get our discussion started.