esmf-org / esmf

The Earth System Modeling Framework (ESMF) is a suite of software tools for developing high-performance, multi-component Earth science modeling applications.
https://earthsystemmodeling.org/
Other
149 stars 70 forks source link

Hanging execution with ESMF managed threading #89

Open theurich opened 1 year ago

theurich commented 1 year ago

There seems to be a subtle issue, that leads to hangs when executing with ESMF managed threading. It can be reproduced under the ESMX_AtmOcnProto from https://github.com/esmf-org/nuopc-app-prototypes/tree/reproducer/hang-threading.

The issue is not ESMX specific or even NUOPC releated, just that it is easy to make a simple reproducer this way.

The hanging occurs reliably on our Darwin test machines. However, it is not only a Darwin issue! It has been observed on Chianti, which is Linux.

The original ticket description:

The ESMX NUOPC app protos have some issues on Catania/gfortran/openmpi. The issue seemed connected to ESMF-managed threading. However, ESMF-managed threading is also tested in other protos, and they did not have problems, so this seems a bit strange.

For now the work-around was to turn off the OpenMP threading for the ESMX prototypes under the NUOPC app branch fix/catania.

It would be good to figure out the problem and not need special handling for Catania for ESMF 8.4.0.

billsacks commented 1 year ago

The same issue appears on my Mac, green, with gfortranclang. I will switch to the fix/catania branch for nuopc testing there as well.

theurich commented 1 year ago

I did some digging on this on Catania (gfortran_11.2.0_openmpi_g_develop). It turns out that this issue is not specific to ESMX. It has nothing to do with the ESMX part! I can also trigger the issue under e.g. AtmOcnPetListProto, when I change the nuopc.configure there.

The significant ingredients for this to hang under Darwin are this:

When the above conditions are met, then the OCN -> ATM Connector will hang during its Run phase. Notice however that it gets through the entire initialization phases without hanging!!!

theurich commented 1 year ago

We also know of a non-Darwin case now where this is hanging: Chianti, which is Linux.