bauman / python-idzip

Seekable, gzip compatible, compression format
MIT License
15 stars 12 forks source link

_select_member optimization #8

Open mobiusklein opened 6 years ago

mobiusklein commented 6 years ago

Profiling of randomly accessing a large number of positions from a large file shows that more time is spent linearly searching for _Member objects than actually performing complex computation on them. I propose to optimize IdzipReader._select_member to detect when the position requested is within the set of parsed members and use a binary search to select the correct member in O(logn) time rather than O(n) time.

Is it at any point reasonable to just load all _Members in a single pass?

mobiusklein commented 6 years ago

This appeared to be related to unfilled members caused by flushing the file.

bauman commented 6 years ago

reopening because this might be easy.

the self._members variable is parsed and loaded when you perform idzipfile.open is performed. The content of the member doesn't need to be parsed yet.

bauman commented 6 years ago

whoops, I read that as chunks, not members. D'oh.

one key aspect I abuse about this implementation is that it does very little actual disk IO until read is called. Hesitant to load a members up front because we might need to io through the file to snag all the headers.

thoughts on perhaps a special (non-default?) precache flag on open?

mobiusklein commented 6 years ago

I had been using a workaround in one utility script I wrote to get around the problem fixed in PR #7, which involved defensively flushing the writer prior to filling up a full member, so I had 100x the number of members I should have had. The natural transition point between linear and binary search should have come much later.

Lazily loading members is the right way to go. Now that I've removed the workaround, the file is packed into three members, which will be searched faster using a linear method still. If I really had a file with that many full members, I'd be better off loading them lazily and using a binary search to find the members who are absolutely known, and add some special case logic for the last member in the list.