Adds design document for an Omega error handler

E3SM-Project / Omega

Next generation ocean model within E3SM

https://docs.e3sm.org/Omega/omega

Other

4 stars 5 forks source link

Adds design document for an Omega error handler #145

Closed philipwjones closed 1 month ago

philipwjones commented 1 month ago

Design for an error handler that builds on the existing Log functionality. Comments and suggestions welcome.

Checklist

[x] Documentation:
- [x] Design document has been generated and added to the docs
- [x] Documentation has been built locally and changes look as expected

cbegeman commented 1 month ago

Looks reasonable to me. Thanks @philipwjones!

vanroekel commented 1 month ago

Looks great to me too, thanks @philipwjones!

xylar commented 1 month ago

Sorry for the delay. I will review this later today.

mark-petersen commented 1 month ago

@philipwjones, in MPAS we had a clunky system of passing err arguments back from subroutines that we never really used, but that we maintained at lower levels. I would like to minimize the burden on the programmers as we add more functions. I would suggest a desired requirement "minimize error handling function arguments" to the extent possible. Do you foresee that we will need to pass an error flag all the way up the call chain in order to terminate cleanly? Will we need to record the call stack manually, or can that be automated?

philipwjones commented 1 month ago

@mark-petersen The critical error prints a stack trace which reduces the need to return error codes and pass the error up the call chain. There will still be some cases where we need to use a Warn/Error and pass a code back to the calling routine. For example, if a Config entry is missing, we want to warn and return an error code so the calling routine can either set a default, use alternative choices or die with a critical error. But most of our current errors are really critical errors and will not require a return code. A "clean" termination is relative - neither passing a return code nor aborting can be completely clean in a parallel application if only a subset of tasks encounters the error (unless you add an implicit barrier to check the code across all MPI tasks before failing). The critical error here just does an MPI_Abort and writes to a task-specific error log.

xylar commented 1 month ago

@mwarusz, your suggestion above looks like an assert if I understand correctly. That seems like a good capability to include. Would you have a failure result in a critical error?

mwarusz commented 1 month ago

@xylar Yes, this is meant to be some form of assert. However, in C++ asserts are typically disabled in Release mode, which motivated my second question. I think it should result in a critical error.

philipwjones commented 1 month ago

@mwarusz Thanks for the comments. Yes, I had considered an interface in which the condition was an argument so we didn't need the explicit conditional. There are some tricky bits about expanding something with an expression and still keeping the line number correct in the message/trace, but it can be done. I can add that to the doc.

philipwjones commented 1 month ago

@mwarusz I just pushed a change to add both a REQUIRE and ASSERT interface. Feel free to suggest further changes or approve if this meets your needs.