biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

defaultdict(int) vs Counter() and memory usage #905

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

While submitting #904, I was looking at this bit of the code in biom/parse.py function parse_uc (Create a Table object from a uclust/usearch/vsearch uc file):

https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L282

I think it would be more memory efficient to replace defaultdict(int) with Counter(), see https://github.com/PyCQA/flake8-bugbear/issues/323

However, that probably deserves a little benchmarking by someone familiar with the code to see if it actually matters for the memory overhead here?

wasade commented 1 year ago

That's a really interesting observation. I think we're safe here though as we are not testing for the presence of missing keys, but instead incrementing specific keys (see https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L338). In https://github.com/PyCQA/flake8-bugbear/issues/323, I think what's driving the memory bloat is the if threshold < counts[str(x)] check forcing the creation of a key : default value pair.

peterjc commented 1 year ago

Probably OK then - yes, the memory bloat is if you try to access the missing entries (because it then adds a 0 entry with the key), made worse if you use long strings as keys as I was in my own code.