Closed giovannipizzi closed 1 year ago
Patch coverage: 100.00
% and project coverage change: -0.69
:warning:
Comparison is base (
83c6957
) 99.55% compared to head (54c7ecc
) 98.86%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I'm still checking some additional things, I want to get some statistics from a real big DB
I've done a test on a real AiiDA repository with ~1million nodes, ~40GB.
Here is the result:
sizes.sum()=40344256249.0, compressed_lengths.sum()=7564127548.0 (18.7%), auto_compressed_lengths.sum()=7564886295.0 (18.8%)
Size of auto-uncompressed: 10230341.0 (smallest = 85.0, largest = 8576.0)
Size of auto-compressed: 40334025908.0 (smallest = 136.0, largest = 6314959.0)
Total number of objects: 1055219
Number of objects with compressed size > uncompressed: 183 (max (in %): 107.05882352941177
Essentially for the typical data I have, I would have gained less than a MB for compressing all instead of only some, automatically. The heuristics works quite well, but that's because all objects are smalls so essentially I'm always compressing all of the files to estimate:
However this is a bit biased, I have only small objects (and then it's not very interesting, I'm essentially compressing everything twice in the heuristics, and I don't gain much). I would need some more interesting dataset with large files.
Anyway, from this, I don't think it makes much sense to try to do logic on the small files - anyway the time I spend on them it's, I think, a small %.
As an additional comment: it took ~360s = 6 minutes to decompress (everything, from fully compressed), and 322s from the automatically compressed; the time to fully recompress everything from decompressed was 325s, and to repack evaluating whether to compress or not 523s (less than twice, as it's more efficient for files > ~1MB).
The examples above use a sample_size of 32kB. However as I comment in my last commit, I revert to 1kB - I tested and this seems to give much more accurate results...
I think it's OK to re-review @zhubonan thanks!
Great, then we don't bother with the small file filter. I can upload one of my databases for you to test.
For the histogram - is the y axis the count of objects or the total size?
Thanks Bonan - if you have a way to transfer it for me it would be great. If we could still do it today it would be great, I won't have much time in the next few weeks and I'd like to wrap this up today if we manage, as I want to also release and put in AiiDA. Let's see if this is possible (I'll work in parallel on the remaining tasks)
For the histogram - is the y axis the count of objects or the total size?
It's the number of objects
Final notes on performance are added in #156
We implement here the feature to change the compression of each pack when the pack is repacked. This gives the possibility to change the compression if, when
pack_all_loose
was called, a different choice was made. This fixes #98.In addition, we implement also a new compression mode,
CompressMode.AUTO
, where some relatively inexpensive heuristics are implemented to decide whether it's worth compressing or not, at an objet level. This fixes #14. Note that basic tests show that the function does what it is supposed to do in simple cases - the heuristics can be improved (in accuracy, or in speed) in the future, without any backward-incompatible issue.In doing so, also the parameter
compress
topack_all_loose
accepts aCompressMode
object, allowing to alrady use the newCompressMode.AUTO
mode.Instead, the functions to write directly to packs only support compression for all or no objects. The best way is probably to not compress by default, and then repack at the end with the
CompressMode.AUTO
mode (this is anyway recommended, as there might be wasted space if the same object is added multiple times to a pack directly, and repacking will clean up this space).In this PR we also fix a couple of minor GitHub Actions failing tests (in benchmarks and periodic tests), including among others support for python 3.11.