EndPointCorp / end-point-blog

End Point Dev blog
https://www.endpointdev.com/blog/
17 stars 65 forks source link

Comments for Postgres WAL files: best compression methods #1294

Open phinjensen opened 6 years ago

phinjensen commented 6 years ago

Comments for https://www.endpointdev.com/blog/2017/03/postgres-wal-files-best-compression/ By Greg Sabino Mullane

To enter a comment:

  1. Log in to GitHub
  2. Leave a comment on this issue.
phinjensen commented 6 years ago
original author: Jörg Sonnenberger
date: 2017-03-28T16:17:41-04:00

Are you aware that plain xz(1) has been supporting threads since 5.2?

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2017-03-28T17:31:59-04:00

I prefer pxz as it has been around a lot longer then xz -T. I also ruled out pixz as this is a single small file, so there is no bonus to indexing. But I've no particular objection to people using xz -T: they are roughly equivalent for the task at hand, I suspect.

phinjensen commented 6 years ago
original author: Unknown
date: 2017-03-30T04:37:57-04:00

We were doing a similar comparison on our side to see if we'd have some gains.

We used to do gzip couple of years ago but switched to bzip2 -9 in 2014 to save some space (at the cost of increased CPU usage, but new CPUs were cheaper than new storage). With the advent of xz it was time to do some testing again. xz is flag compatible with gzip / bzip2 but the compression levels -1..9 have huge impact on memory consumption, cpu usage and overall duration.

| Alg | Duration | Size | | bzip2 -9 | 2.1s | 4.5 MiB | | xz -1 | 1.2s | 3.8 MiB | | xz -2 | 1.7s | 3.6 MiB | | xz -3 | 7.6s | 3.2 MiB | | xz -4 | 7.9s | 3.1 MiB | | xz -5 | 8.2 | 3.1 MiB | | xz -6 | 9.6 | 3.0 MiB | | xz -7 | 10.2 | 3.0 MiB | | xz -8 | 12.5 | 3.0 MiB | | xz -9 | 10.4 | 3.0 MiB |

Of course the results cannot be generalized, but I found the results coherent through different sample data sets for different databases. I decided to choose xz -2 as it gives a good trade-off between speed and compression ratio; After all, it's ~19% faster than current bzip2 and offering a ~20% better compression ratio. The additional savings from higher ratios are payed by higher duration - and nowadays I'd like to avoid eating CPU in a virtual environment to avoid being a noisy neighbour.

So, let's have a look at read world results from our compression. Of course the exact amount of archive files is not fixed any day, but it I think we can see some good example data:

general OLTP database 20161215_0214: count=1961 size=16G 20161216_0214: count=2022 size=18G 20161217_0214: count=1784 size=16G 20161218_0214: count=2085 size=16G 20161219_0214: count=1488 size=7.8G

billing database 20161215_0120: count=753 size=4.5G 20161216_0120: count=608 size=3.8G 20161217_0120: count=605 size=3.8G 20161218_0120: count=582 size=3.2G 20161219_0120: count=490 size=1.5G

As you can see, although we had slightly less archive files, there's a considerable improvement in space usage. The overall compression job worked faster in total aswell, so we have a win at all ends.

Cheers,

Wiktor

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2017-04-02T13:50:00-04:00

Wiktor, thanks for your comments. Interestingly, xz -2 is also the best tradeoff I found for a long time - before the advent of multi-threaded versions. It is still a pretty good default value!