BerkeleyLab / caffeine

A parallel runtime library for Fortran compilers
https://berkeleylab.github.io/caffeine/
Other
40 stars 7 forks source link

Implement graceful error handling for allocation failures #88

Closed bonachea closed 2 months ago

bonachea commented 6 months ago

In the near term, Caffeine will likely treat most errors as immediately fatal (ideally with a high-quality message as part of the error crash).

However one particularly important error that IMO should not get this treatment is memory allocation. Unlike hardware failure, out-of-memory is a common condition when scaling problems in real production science, and needs to be handled in a robust manner by a production-quality runtime. It's even plausible that some applications might perform non-trivial recovery from allocation failure.

prif_allocate and prif_allocate_non_symmetric currently ignore the possibility of errors and I suspect they crash in obscure ways upon memory exhaustion. IMO these two calls should be fixed to strictly adhere to Fortran error handling semantics, specifically wrt returning meaningful stat and errmsg (when provided) or crashing with a useful console message (when not provided). Ideally the error message in either case should include status information about the initial and current state of the shared heaps, and recommendations to the end-user about how to resolve the problem.

bonachea commented 2 months ago

duplicate of #128