Change default compression algorithm

dionhaefner commented 6 years ago

@j08lue found this interesting article benchmarking GDAL's compression algorithms. We are currently using LZW, which is horrible for floating data. We mostly care about read speed, then compression ratio, then write speed, so ZSTD looks like a better alternative.

Documentation of GDAL's compression options can be found here.

mrpgraae commented 6 years ago

What is this none algorithm? It seems to be way better than everything else for speed, and for float compression it's still better than LZW.

Jokes aside, maybe we should add an option for turning off compression? I could see the increased disk use as an acceptable trade-off for speed (disk space is cheap).

dionhaefner commented 6 years ago

Something the article doesn't consider is large patches of nodata. So in practice, the space savings are often much bigger than the given compression ratios. So I think some light compression should still be the default, but I agree that an option to turn it off shouldn't hurt.

mrpgraae commented 6 years ago

Something the article doesn't consider is large patches of nodata. So in practice, the space savings are often much bigger than the given compression ratios. So I think some light compression should still be the default, but I agree that an option to turn it off shouldn't hurt.

Yes, having compression on is a sane default. But being able to turn it off seems like a potentially very big performance boost, given that reading in the tiles is the current performance bottleneck, AFAIK?

dionhaefner commented 6 years ago

given that reading in the tiles is the current performance bottleneck

That's true, but I don't know if we are actually CPU bound or IO bound in real-world applications (probably both; I would expect Lambda deployments to be IO bound, and WSGI deployments to be CPU bound). If we are IO bound, compression should actually increase performance :)

I'll introduce a flag.

dionhaefner commented 6 years ago

Can't use ZSTD compression yet, since the linked GDAL version via conda-forge is too low. I'd expect it to be bumped in one of the upcoming rasterio releases.

For the record, we have been using DEFLATE, not LZW.

mrpgraae commented 6 years ago

That's true, but I don't know if we are actually CPU bound or IO bound in real-world applications (probably both; I would expect Lambda deployments to be IO bound, and WSGI deployments to be CPU bound). If we are IO bound, compression should actually increase performance :)

Yes, he's doing the benchmark on an SSD machine, which is probably a crucial factor. I would expect non-SSD machines to be heavily I/O bound and so yes, I would then also expect compression to actually speed up the reading.

I looked at some other benchmarks of compression algorithms but it actually seems to be very difficult to do a useful benchmark. Decompression performance is of course highly dependent on the data, but apparently also on CPU architecture and not just CPU speed. so it seems that it's highly situation and machine dependent. But the order of magnitude appears to be 100MB/s to 2GB/s. Where a modern SSD has read speeds at around 650MB/s to 2.3GB/s. In any case, I think it will be impossible to set a default which will work equally well for all setups.

DHI / terracotta

Change default compression algorithm #87