[EPIC] ErrorManager - Githubissues

Main Goal

The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions

Issues in this EPIC

[ ] 1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
- Numeric errors, memory overflow, IO errors,
- All exception types in use in GEOS,
- std errors ("map::at" user reported error, std::vector resizing...),
- Unknown exceptions,
- MPI errors
- exception / error while catching an exception,
- unexpected program exits (we should at least have the stacktrace),
- ...

[ ] 2. Create the ErrorManager class, which :
- Provides a centralized point to throw and manage the GEOS errors / exceptions,
- Is based on structured error data rather than only texts,
- Must be reliable,
- Produces clear console outputs (not comprehensive, depending on the user type),
- Produces a generated error data file...
  - that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
  - also as, at the end, a comprehensive history of all log messages (even under not selected LogLevel, but higher than too verbose logs to keep program performance),
- Has only GEOS_HOST methods, to ensure that only CPUs can throw / manage errors.

The error data structure can contain:

Error message,
timestamp,
Location in the code, stack-trace,
Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
GEOS loading / simulating phase,
TimeStep, substeps stats (count, minDt, maxDt), substeps dt list, convergence step and converged attribute,
MPI rank,
Parent exception data,
… (don't hesitate to suggest more data)

[ ] 3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.

The goal here is to solve this classic problem : _Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed._

The solution I would like to propose is to process the error data files either : a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout, b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder. Because of HPC considerations, the a) method could be enabled by adding a command line parameter.

[ ] 4. Properly manage TPL errors (by: 1. adding human explanation on what GEOS was trying to do, and 2. if possible, mentioning why calls are failing),
- GEOS_LAI_CHECK_ERROR() macro failures,
- GEOS_PARMETIS_CHECK() / GEOS_SCOTCH_CHECK() macro failures,
- CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())

[ ] 5. All errors from the unit test must be properly interfaced with python / pygeos

[ ] 6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.

The following practices are banned :

Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
Throwing any error / exception or writing any log from a GEOS_HOST_DEVICE context.
- If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with RAJA::ReduceMin and read it from the host context to write a proper contextualized message.
- because of the memory impact, any call to CUDA printf() is banned.
... (don't hesitate to suggest more)
[ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.
- Search where warning could be used rather than logs in the code,
- Remove any possibility to add an error / a log from the GPU (too cache heavy),
- Control the flow of every exception in GEOS.

GEOS-DEV / GEOS

[EPIC] ErrorManager #2940

Main Goal

Issues in this EPIC