10XGenomics / cellranger

10x Genomics Single Cell Analysis
https://www.10xgenomics.com/support/software/cell-ranger
Other
348 stars 92 forks source link

Filter first or count first? #85

Closed andreasg123 closed 3 years ago

andreasg123 commented 4 years ago

While looking through your company's open source offerings, I came across this statement: https://github.com/10XGenomics/cellranger/blob/master/mro/stages/counter/check_barcodes_compatibility/__init__.py#L119

sampled_bc_counter_in_wl[gem_group][lib] = {k : v for (k,v) in Counter(sampled_bc).iteritems() if k in unique_bc_in_wl}

It seemed odd to me that one would count first and then filter the result. However, when I ran a benchmark, that strategy was superior if the whitelist contained more than 15% of the unique keys:

import random
import timeit

test = [random.randrange(100) for i in range(1000)]
whitelist = set(range(15))

number = 100000
setup = 'from collections import Counter; from __main__ import test, whitelist'
print(timeit.timeit('{k: v for k, v in Counter(test).items() if k in whitelist}',
                    number=number, setup=setup))
print(timeit.timeit('Counter((x for x in test if x in whitelist))',
                    number=number, setup=setup))

Still, even in that situation, your code could be improved by calling Counter instead of set a few lines earlier. Both calls make the dataset unique and thus it should be done only once.

If you want to change it, I would be happy to submit a PR.

evolvedmicrobe commented 3 years ago

Thanks @andreasg123, with Cell Ranger 4.0 and later versions all this code has been re-written in a more performant compiled language (rust), and is significantly faster! Agreed, we had some optimizations available here! Thanks for the input.