COSIMA / matm

A data-driven atmosphere model. Uses OASIS coupler to deliver CORE2 fields to other models.
1 stars 1 forks source link

How best to implement monitoring code? #9

Open russfiedler opened 6 years ago

russfiedler commented 6 years ago

After this morning's meeting I thought I'd have a look at what would be the cleanest way to hook the dodgy_monitor module in to MATM.

The at first obvious spot is in the main matm module. I think we want a timing block prior to the main loop (around init_cpl) , another within the the main loop and one at the end.

Note that the initial call must be after MPI_INIT is called and the last before MPI_FINALIZE. This represents a problem since putting the initial and final blocks in the main program means that we might miss out blocks hanging in prism_init and we can't time the cleanup at all.

The second option is to put all the stuff in the cpl_interfaces module in cpl_init, into_cpl and coupler_termination. It hides this away and theres a whole lot of other MPI stuff in that module. It just feels cleaner.

I reckon 3 independent timing parameters are required i.e. the max number of seconds allowed for a block. There also needs to be a logical as to whether an abort is called or just a warning and maybe a logical to bypass the monitoring all together.

Another thing that could be implemented is just to report timing stats so the user can make sensible choices.

Oh yeah, I wonder if it's worth ordering the mpirun command so that matm is last and the spawned monitoring process is on the same node as MATM or is something funky with hostfiles being done in payu?

Should the monitoring stuff be brought into the MATM repository proper. I reckon so. The names of the dodgy modules will have to be changed of course :(

pinging @nicjhan @aidanheerdegen @marshallward @aekiss

nichannah commented 6 years ago

Thanks for this analysis @russfiedler. I am in the process of doing a little clean up of MATM for safer IAF support and date handling. Currently writing tests. I can put this into the cleaned version. Alternatively we could put it in the current version and then move it across.

russfiedler commented 6 years ago

I think adding it to update is the way to go @nicjhan. Give me a yell when you are ready to start testing or if you think we can implement things in a different way. I think the main issue is whether we want to spawn the monitoring process from MATM like I have or whether it should be a part of the main group. I just wasn't sure if there were any issues with MPI_COMM_WORLD communications so I decideed to use an intercommunicator like this.

nichannah commented 6 years ago

Hi @russfiedler I agree with the idea of putting this within the coupler code/library. I imagined that the monitor would receive a hearbeat from, say, MATM via MPI. Given that we are using an uneven number of PEs I don't think adding another one is a problem.

A possible problem here is that AFAIK all PEs need to call oasis_init_comp(). However I'd say that the likelihood of a hang here is low... it mostly happens after the variables/grids have been setup and Oasis is doing setup.

russfiedler commented 6 years ago

@nicjhan We can spawn the extra process and set up a timing block around around oasis_init_comp() if need be. That just means setting up the timing from within prism_init() rather than init_cpl().

As an aside, are we going to move from the 'prismproto' naming convention to the more official 'oasis' style? It's much nicer but is mostly cosmetic (I think there's 1 routine that has an extra optional argument) .

nichannah commented 6 years ago

@russfiedler re the oasis naming convention: yes I've moved to the oasis_ style. When/if you have time can you please take a look at my branch: https://github.com/OceansAus/matm/tree/8-date-based-indexing

I'd be very grateful for any feedback. ping @aidanheerdegen @marshallward

P.S. the current coupling setup between atm, ice_stub and ocean_stub is intended to mirror almost exactly the structure of access-om2. If we find problems with the coupling in this test case it's possible that it exists in access-om2 as well.