Inefficient ProxyDatabase slicing

DominikStiller commented 5 days ago

Operations on a ProxyDatabase that loop over all records slow down/do not have constant iteration time (as measured by tqdm's it/s), which is especially noticeable for large databases. Take for example slicing: https://github.com/fzhu2e/cfr/blob/a99c13bfeae4643b64c63be339894b48abd7f5ef/cfr/proxy.py#L1730-L1746

The operation new += spobj calls .refresh() every time, which itself iterates over all proxies. Therefore, it is O(n^2), while it could easily be O(n) if they were added to a dictionary first, then converted to a ProxyDatabase at the end. Is there a reason this is not being done?

fzhu2e commented 5 days ago

Hi Dominik, thanks for pointing this out! I'm sure the performance of the codebase is not optimal and lots of improvements could be done. I developed most of the infrastructure from scratch in a relatively short time period, and the runtime speed was usually not the priority as long as it's not significantly slowing down the major tasks. Now that this project is open to the community, you are mostly welcome to implement those improvements and submit PRs. Once there is sufficient upgrade of the codebase in terms of features and performance, we may submit another manuscript to report the updates.

DominikStiller commented 5 days ago

Thanks for clarifying, I'll submit a PR for this issue!

fzhu2e / cfr

Inefficient ProxyDatabase slicing #20