GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
222 stars 89 forks source link

[EPIC] ErrorManager #2940

Open MelReyCG opened 9 months ago

MelReyCG commented 9 months ago

Main Goal

The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions

Issues in this EPIC


The error data structure can contain:


The goal here is to solve this classic problem : _Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed._

The solution I would like to propose is to process the error data files either : a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout, b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder. Because of HPC considerations, the a) method could be enabled by adding a command line parameter.




The following practices are banned :

jeannepellerin commented 8 months ago

@rrsettgast