[GEOS] Define operational and HPC metrics

FlorianDeconinck commented 5 months ago

Previous benchmark have been done with the "Node-to-node" metric to answer the question "can we replace a CPU node with a GPU node".

As we gear toward operation, this metric is no longer enough, should also be backed with more scientifically relevant metrics (Gridpoint, SYPD, SDPD which seems to be the GMAO preferred metric etc.).

We should also start measuring ourselves against the SCU17/18 Milan nodes and their 128 cores.

Electric consumptions and price are also previous metric we should carry.

Another angle is scaling and operational usefulness of each hardware, so that the narrative to the scientists is clear.

This process should involve the GMAO but remain lead by us as to make sure we can deliver.

Overall, pragmatism is key: we are not here to give roofline projection and peak FLOPS, we are here to deliver day-to-day usage.

[ ] Document metrics to be used, their impact and logic
[ ] Create a version document of methodology to be applied for each metrics

FlorianDeconinck commented 2 months ago

Has part of this work we should also do projection of requirements for running bigger simulations, now and every year upward.

Per Tsengdar"

Can we estimate how many GPUs and CPU-GPU configuration that we need to support this project in C1440-L181 resolution in FY26? Do we have access to what we need?

Per Laura:

What do you need from us to be successful?

FlorianDeconinck commented 1 month ago

Working on it as part of the SC24 presentation.

Science

Resolution: required resolution to be run.
- GEOS: expressed in average kilometer for a square cell on the cube-sphere
Model skill: physical processes to be resolved
- GEOS: dynamics, moist physics, chemistry, radiation, ocean coupling, land surface...
Throughput: wall-time for the target simulation
- GEOS: expressed in Simulated Day per Day (SPDP)
Software
Features: required skills of the technology to express the science
- Performance
- Ease of use
- Completion
Maintainability: tools to ensure enduring good science code
- Continuous Integration (CI) with unit, regression and functional testing
- Numerical debug tools
- QoS: Documentation tooling, coding standards, collaborative tools (user manual)
Technological Debt: managing the inevitable growth in code
- Automatic coverage of the science code
- Access and documentation of supporting frameworks and/or middleware
Operations
Time to solution: required wall time on a given hardware
Energy use: per hardware energy use (in KW)
Hardware optimization: per hardware memory bandwidth usage (in % of theoretical maximum)

GEOS-ESM / SMT-Nebulae

[GEOS] Define operational and HPC metrics #66

Science

Software

Operations