The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions
Issues in this EPIC
[ ] 1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
Numeric errors, memory overflow, IO errors,
All exception types in use in GEOS,
std errors ("map::at" user reported error, std::vector resizing...),
Unknown exceptions,
MPI errors
exception / error while catching an exception,
unexpected program exits (we should at least have the stacktrace),
...
[ ] 2. Create the ErrorManager class, which :
Provides a centralized point to throw and manage the GEOS errors / exceptions,
Is based on structured error data rather than only texts,
Must be reliable,
Produces clear console outputs (not comprehensive, depending on the user type),
Produces a generated error data file...
that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
also as, at the end, a comprehensive history of all log messages (even under not selected LogLevel, but higher than too verbose logs to keep program performance),
Has only GEOS_HOST methods, to ensure that only CPUs can throw / manage errors.
The error data structure can contain:
Error message,
timestamp,
Location in the code, stack-trace,
Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
[ ] 3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.
The goal here is to solve this classic problem : _Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed._
The solution I would like to propose is to process the error data files either :
a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout,
b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder.
Because of HPC considerations, the a) method could be enabled by adding a command line parameter.
[ ] 4. Properly manage TPL errors (by: 1. adding human explanation on what GEOS was trying to do, and 2. if possible, mentioning why calls are failing),
CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())
[ ] 5. All errors from the unit test must be properly interfaced with python / pygeos
[ ] 6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.
The following practices are banned :
Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
Throwing any error / exception or writing any log from a GEOS_HOST_DEVICE context.
If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with RAJA::ReduceMin and read it from the host context to write a proper contextualized message.
because of the memory impact, any call to CUDA printf() is banned.
... (don't hesitate to suggest more)
[ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.
Search where warning could be used rather than logs in the code,
Remove any possibility to add an error / a log from the GPU (too cache heavy),
Main Goal
The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions
Issues in this EPIC
std
errors ("map::at
" user reported error,std::vector
resizing...),LogLevel
, but higher than too verbose logs to keep program performance),GEOS_HOST
methods, to ensure that only CPUs can throw / manage errors.The error data structure can contain:
count
,minDt
,maxDt
), substepsdt
list, convergence step and converged attribute,The goal here is to solve this classic problem : _Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent
MPI_ABORT
errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed._The solution I would like to propose is to process the error data files either : a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the
stdout
, b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder. Because of HPC considerations, the a) method could be enabled by adding a command line parameter.GEOS_LAI_CHECK_ERROR()
macro failures,GEOS_PARMETIS_CHECK()
/GEOS_SCOTCH_CHECK()
macro failures,GEOS_HYPRE_CHECK_DEVICE_ERRORS()
,cudaGetLastError()
)pygeos
The following practices are banned :
Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
Throwing any error / exception or writing any log from a
GEOS_HOST_DEVICE
context.RAJA::ReduceMin
and read it from the host context to write a proper contextualized message.printf()
is banned.... (don't hesitate to suggest more)
[ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.