Closed MHLoppy closed 1 year ago
It's actually not so simple. I've got over 50,000 user-submitted compression results (31993 if you just count version 3) and the compression varies drastically depending on the program/game you compress. If you use something like Compactor, that can actually give you an estimate because it does a fast partial compression over the data first to guess how good it will compress (for example if a file is 5GB, it will randomly sample 5KB of data) and uses that to build an estimate.
I was in the process of doing something similar last year before I got distracted and had to put this project aside.
I've used the utility quite a lot (thanks for making it!), so am familiar with the wildly different compression ratios depending on what's being compressed.
Maybe we were coming at it from different perspectives - I wasn't trying to answer "how much will this unknown folder compress", but rather "is it worth using one of the heavier compression algorithms on this folder". By visualizing AppIDs that have a result for more than one algorithm, we can observe that the ratios between the differences in compression efficiency are similar irrespective of the absolute amount of compression (with only a few outliers).
The key point being that when one algorithm can compress only a little, the next-heaviest algorithm is extremely unlikely to gain a substantial amount more. When an algorithm does quite well, the next-heaviest algorithm will have a higher average (absolute) gain. Etc.
Thus, the average values do provide actionable insight on whether an otherwise-unknown folder is likely to be worth compressing using a heavier algorithm.
In any case, if you are planning to add in a "pre-compression check" as you've described, then this would be much less important and we can get real information to make a decision for the files in a specific unknown folder instead!
@MHLoppy Oh, I see what you're saying now! That's quite the impressive chart too
I think that including the average reported compression rates would be helpful for at-a-glance information to help make informed decisions about what algorithm is worth using if data for that specific game/program isn't available.
I calculated the average % savings reported in the spreadsheet as follows:
You might have more up to date data to calculate with, but I imagine the results would be similar overall.
Edit: Using the filtered data from the followup -- where only results that include data for all four algorithms are included -- has the following results instead which are probably better at summarizing the relevant data (though also imperfect!):