Lossy Compresion Algorithm for Climate Data

Dataset

We are working with the "ERA5 hourly data on pressure levels from 1979 to present" and the "ERA5 hourly data on single levels from 1979 to present" datasets provided by https://cds.climate.copernicus.eu.

ERA5 data

Each file has four dimensions and one attribute.
These dimensions are called time, level, latitude, longitude.
The time.dtype dimension is np.datetime64.
The level dimension is given in Pascal.
The longitude dimension is ascending and within the range of [-180;180].
The latitude dimension is ascending and within the range of [-90;90].
Attributes like temperature, humidity, etc.
Each datafile represents a single month for a single year for a single attribute.
The resolution of each datafile ist t x 37 x 720 x 1440.
t is time in hours for each month. Roughly 24 x 30 = ~720.

Model

The compression algorithm is an Autoencoder based one. An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name.

Performance

The performance of the presented compression algorithm is compared with zfp and SZ. zfp and SZ both an open-source library for compressed floating-point arrays that support high throughput read and write random access.

Usage

An example of how to use the compression algorithm is given in the following jupyter Notebook.

The compression algorithm consists of two functions located in compress.py and decompress.py. To compress the data just call:

from lossycomp.compress import compress
from lossycomp.decompress import decompress

compressed_data = compress(data, abs_error, verbose = True)
decompressed_data = decompress(compressed_data, verbose = True)

The compression function also outputs information about the compressed data like the achieved compression factor and the space consumption of the elements in the compressed data.

References

Datasets

Tensorflow

ZFP , FPZIP and SZ

SilkeDH / lossy-ml