ECMWFCode4Earth / challenges_2023

Discover the ECMWF Code for Earth 2023 challenges
53 stars 5 forks source link

Challenge 12 -Compression of Geospatial Data with Varying Information Density #3

Open EsperanzaCuartero opened 1 year ago

EsperanzaCuartero commented 1 year ago

Challenge 12 - Compression of Geospatial Data with Varying Information Density

Stream 1 - Software Developments for Earth Sciences

Goal

Development of an information-density adapting compression

Mentors and skills


Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).


Challenge description

Geospatial data can vary in its information density from one part of the world to another. A dataset containing streets will be very dense in cities but contains little information in remote places like the Alps or even the ocean. The same is also true for datasets about the ocean or the atmosphere. The variability of sea surface temperatures and currents is much larger in the vicinity of the golf stream than in the middle of the Atlantic basin. This variability might also change in time. A hurricane, for example, has a lot of variability in winds, temperature and rain rates, and travels in addition across entire ocean basins.

The challenge of this project is to improve xbitinfo to preserve the natural variability of these features but not to save random noise where the real information density is rather low. This means in particular that the number of bits needed to preserve in compression changes with location. A hurricane has a different information density than a same-sized area in the steadily blowing trade-wind regimes. Compressibility of climate data therefore can change drastically in time and space, which we want to exploit.

Currently in the bitinformation framework, to preserve all real information, the maximum information content calculated by xbitinfo needs to be used for the entire dataset. However, bitinformation can also be calculated on subsets, such that the ‘boring’ parts can therefore be more efficiently compressed.

Xbitinfo is an open-source Python package that enables lossy compression of geo-spatial data based on its information content. Embedded into the pangeo ecosystem, xbitinfo builds on top of xarray and dask and allows for fast compression and analysis of various data formats including netCDF and zarr. Xbitinfo addresses the challenge of increasingly large datasets split into chunks that are currently created due to increasingly available compute power. Climate simulations with resolutions of sub-km scale with petabytes of output are just one example where xbitinfo can help to keep the dataset manageable.

The successful applicant will refine the implementation of xbitinfo to data subsections (chunks) and improve our ability to compress spatially and temporal varying fields. Furthermore, the applicant will learn about information theory and software engineering with international mentors.

References:

edwardhartnett commented 1 year ago

Note that we recently added some compression features to netCDF, including support of lossy compression and support for the faster zstandard compression library. These may be helpful to those working on this challenge. For more details see: https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements

milankl commented 1 year ago

Amazing, thanks Ed. Great summary!

ayoubft commented 1 year ago

Hello there!

I came across this project and it immediately caught my attention. The idea seems very interesting and I would love to learn more about it. I am writing to express my keen interest in this project.

Started out a draft for the proposal and during my research, I found out that this project is listed as a Google Summer of Code (GSoC) project .

Please let me know if there are any updates regarding the project considering that GSoC deadline is April 4th.

milankl commented 1 year ago

Hi Ayoub! Thanks for your interest!! Yes, indeed, we also got this project into the Google Summer of Code, meaning that it is possible to get funding through either track. Note the different deadlines though. We therefore expect two participants (one from code for earth, one from summer of code) to work on xbitinfo simultaneously. Depending on the proposals we will then define the individual projects in discussion with the participants so that they are somewhat independent of another. For us mentors there's no difference once you get accepted through summer of code or code for earth, but the programmes are distinct and there's only funding to accept one from each.

So yes, please write down your ideas and interests into a proposal and apply! You can also pick up ideas from the project ideas we wrote down for GSoC. In the end, we would like to see that you understood the challenge and have ideas how to solve it and a motivation to work on this during the summer.

ayoubft commented 1 year ago

Thank you so much Milan for your response and for clarifying the details about the project and the funding options available. I've already taken a look at the project idea listed for GSoC, and I'm excited to continue working on my proposal. This project is a fantastic opportunity to learn and develop new skills, and I'm eager to understand the challenge and come up with innovative ideas for how to solve it.