Challenge 22 - Discovering hidden patterns on Climate Data Store

Stream 2 - Machine Learning for Earth Science

Goal

CDS produce a wide set of transactional records and operational logs which contains a lot of hidden information that would represent a very valuable insight to better understand and predict system patterns, user behaviours and preferences, and early warnings, ... this could result in improvements in the system and more dynamic configuration (QoS).

The aim of the project is to explore what Ml/AI can bring to reveal this information and how this could be later applied for CADS Operation.

Mentors and skills

Mentors: Angel Lopez, Gionata Biavati
Skills required:
- Python (numpy, pandas, xarray)
- ML/AI models - python libraries
- SQL
- Splunk (Optional)

Note: Only nationals from European Union (EU) Member States and countries associated with EU’s Space Programme (currently Iceland and Norway) are eligible to participate (see Terms and Conditions).

Challenge description

Currently, the information obtained about users is based on very generic indicators and graphs. Going deeper into the exploration of data and logs is done case by case when particular issues or requests need to be addressed.

Currently, the number and volumes of transactions and data are such that these operations become more and more complicated.

Data/System to use

Climate and Atmosphere Data Stores transactional information (user requests) is supported by a Postgres DB. Operational information from the system components is registered in different logs.

Both sources of information are indexed on Splunk in almost real-time. Information can be directly exploited via Splunk or exported to be used in other environments.

Solution

Applying ML/AI models to the data collected by the system will allow to the extraction of hidden knowledge about user patterns, and cause-effect issues,...

This knowledge will allow us to better understand the system, put in place more dynamic configuration (QoS), tune the system, implement new features on the system, inform users, and organise the catalogue structure, ...

Ideas for the implementation

Quality of Service Rules are a key component to protect the system as it allows handling the management of processing requests by balancing users' requirements with available resources at the system level. Currently, QoS is manually managed based on perceptions and visited reports. The output of this project could trigger some automatic updates on some QoS rules based on discovered information.
Issues on the system usually start as a consequence of bad actions from abusive users or badly performed requests. Sometimes this response is to well-known causes (eg. users fishing the latest available data even before this is released). The more we are able to understand these behaviours the better we can react and put in place contingency actions.
Does the user download exactly what we need or we are forced to download more to later extract what he really looks for? What are the looking for that is not there and what information can we get from there to improve the system performance or/and access to data? (eg. How user structure request? Is that optional? Are they looking for something in the wrong way or even something that does not exist yet?

ECMWFCode4Earth / challenges_2023