biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
585 stars 91 forks source link

Support compressed download in `database.rcsb.fetch()` #532

Open padix-key opened 5 months ago

padix-key commented 5 months ago

The RCSB PDB provides all files also in gzipped format. Therefore, to improve download times in database.rcsb.fetch(), one could optionally download the gzipped files and and unzip the HTTP response content via Python's gzip module, before writing the structure file to disk.

Orpowell commented 5 months ago

Hi!

I'm really keen to contribute to Biotite so I ran a few tests on this. It seems that the speed up for downloading gzipped files is fairly negligible when you account the time for required to unzip the file. The results were generated using repeat() from timeit with 10 runs and 100 repetitions (1000 repetitions in total) and are in the table below. You can find the test code here.

download type speed (s)
pdb 5.02787
gzipped pdb 5.00965
difference 0.01822

There might be a way to eek-out more performance but I'm not sure how you'd do it. If you still think this is worth adding to the library - I'm happy to finish off the implementation. Let me know what you think!

Cheers,

Ollie

padix-key commented 5 months ago

Thanks for the benchmark. I created a modified version of your script (larger structure, omitted writing step) and found similar results: The differences are marginal and which one is faster is not clear.

Still a compressed download probably makes sense, in case the bandwidth is limiting. I just would not use it as the default. So if you still like to implement this feature, feel free to do so :+1:.

Orpowell commented 4 months ago

Awesome I'll start working on it 👍.