Implement recompression in repacking, and auto compression heuristics

giovannipizzi commented 1 year ago

We implement here the feature to change the compression of each pack when the pack is repacked. This gives the possibility to change the compression if, when pack_all_loose was called, a different choice was made. This fixes #98.

In addition, we implement also a new compression mode, CompressMode.AUTO, where some relatively inexpensive heuristics are implemented to decide whether it's worth compressing or not, at an objet level. This fixes #14. Note that basic tests show that the function does what it is supposed to do in simple cases - the heuristics can be improved (in accuracy, or in speed) in the future, without any backward-incompatible issue.

In doing so, also the parameter compress to pack_all_loose accepts a CompressMode object, allowing to alrady use the new CompressMode.AUTO mode.

Instead, the functions to write directly to packs only support compression for all or no objects. The best way is probably to not compress by default, and then repack at the end with the CompressMode.AUTO mode (this is anyway recommended, as there might be wasted space if the same object is added multiple times to a pack directly, and repacking will clean up this space).

In this PR we also fix a couple of minor GitHub Actions failing tests (in benchmarks and periodic tests), including among others support for python 3.11.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.69 :warning:

Comparison is base (83c6957) 99.55% compared to head (54c7ecc) 98.86%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #148 +/- ## ========================================== - Coverage 99.55% 98.86% -0.69% ========================================== Files 8 8 Lines 1784 1858 +74 ========================================== + Hits 1776 1837 +61 - Misses 8 21 +13 ``` | [Impacted Files](https://app.codecov.io/gh/aiidateam/disk-objectstore/pull/148?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam) | Coverage Δ | | |---|---|---| | [disk\_objectstore/container.py](https://app.codecov.io/gh/aiidateam/disk-objectstore/pull/148?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam#diff-ZGlza19vYmplY3RzdG9yZS9jb250YWluZXIucHk=) | `99.41% <100.00%> (-0.47%)` | :arrow_down: | | [disk\_objectstore/utils.py](https://app.codecov.io/gh/aiidateam/disk-objectstore/pull/148?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam#diff-ZGlza19vYmplY3RzdG9yZS91dGlscy5weQ==) | `97.44% <100.00%> (-1.35%)` | :arrow_down: | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/aiidateam/disk-objectstore/pull/148/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

giovannipizzi commented 1 year ago

I'm still checking some additional things, I want to get some statistics from a real big DB

giovannipizzi commented 1 year ago

I've done a test on a real AiiDA repository with ~1million nodes, ~40GB.

Here is the result:

sizes.sum()=40344256249.0, compressed_lengths.sum()=7564127548.0 (18.7%), auto_compressed_lengths.sum()=7564886295.0 (18.8%)
Size of auto-uncompressed: 10230341.0 (smallest = 85.0, largest = 8576.0)
Size of auto-compressed:   40334025908.0 (smallest = 136.0, largest = 6314959.0)
Total number of objects: 1055219
Number of objects with compressed size > uncompressed: 183 (max (in %): 107.05882352941177

Essentially for the typical data I have, I would have gained less than a MB for compressing all instead of only some, automatically. The heuristics works quite well, but that's because all objects are smalls so essentially I'm always compressing all of the files to estimate:

Figure_1

However this is a bit biased, I have only small objects (and then it's not very interesting, I'm essentially compressing everything twice in the heuristics, and I don't gain much). I would need some more interesting dataset with large files.

Anyway, from this, I don't think it makes much sense to try to do logic on the small files - anyway the time I spend on them it's, I think, a small %.

giovannipizzi commented 1 year ago

As an additional comment: it took ~360s = 6 minutes to decompress (everything, from fully compressed), and 322s from the automatically compressed; the time to fully recompress everything from decompressed was 325s, and to repack evaluating whether to compress or not 523s (less than twice, as it's more efficient for files > ~1MB).

giovannipizzi commented 1 year ago

The examples above use a sample_size of 32kB. However as I comment in my last commit, I revert to 1kB - I tested and this seems to give much more accurate results...

I think it's OK to re-review @zhubonan thanks!

zhubonan commented 1 year ago

Great, then we don't bother with the small file filter. I can upload one of my databases for you to test.

For the histogram - is the y axis the count of objects or the total size?

giovannipizzi commented 1 year ago

Thanks Bonan - if you have a way to transfer it for me it would be great. If we could still do it today it would be great, I won't have much time in the next few weeks and I'd like to wrap this up today if we manage, as I want to also release and put in AiiDA. Let's see if this is possible (I'll work in parallel on the remaining tasks)

giovannipizzi commented 1 year ago

For the histogram - is the y axis the count of objects or the total size?

It's the number of objects

giovannipizzi commented 1 year ago

Final notes on performance are added in #156

aiidateam / disk-objectstore

Implement recompression in repacking, and auto compression heuristics #148

Codecov Report