COSIMA / cice5

Clone of The Los Alamos sea ice model (CICE) with ACCESS drivers. See https://github.com/CICE-Consortium/CICE-svn-trunk/tree/cice-5.1.2
4 stars 13 forks source link

libaccessom2 intercommunicator profiling support #17

Closed marshallward closed 6 years ago

marshallward commented 6 years ago

This patch modifies CICE to support changes to libaccessom2 which allow the user to disable intercommunicators, instead using MPI_COMM_WORLD to exchange messages between models. This work is primarily to support limitations in the Score-P profiler.

Specific changed detailed below.

Further changes to cpl_interface are possible, since most of the MPI information is provided by the coupler object.

marshallward commented 6 years ago

Thanks Nic Just a warning that this won't build unless the libaccessom2 are also merged.

nichannah commented 6 years ago

Yep. Did you see any problems with it? I'm getting the following:

forrtl: error (73): floating divide by zero Image PC Routine Line Source cice_auscom_1440x 0000000000924461 Unknown Unknown Unknown cice_auscom_1440x 000000000092259B Unknown Unknown Unknown cice_auscom_1440x 00000000008D03B4 Unknown Unknown Unknown cice_auscom_1440x 00000000008D01C6 Unknown Unknown Unknown cice_auscom_1440x 000000000084E7F9 Unknown Unknown Unknown cice_auscom_1440x 00000000008591A9 Unknown Unknown Unknown libpthread-2.12.s 00002B4C56A8D7E0 Unknown Unknown Unknown cice_auscom_1440x 000000000044C657 ice_atmo_mpatmo 352 ice_atmo.f90 cice_auscom_1440x 00000000005E1946 ice_step_mod_mp_s 395 ice_step_mod.f90 cice_auscom_1440x 000000000040FBE2 cice_runmod_mp_ci 329 CICE_RunMod.f90 cice_auscom_1440x 000000000040C847 MAIN 60 CICE.f90 cice_auscom_1440x 000000000040C7DE Unknown Unknown Unknown libc-2.12.so 00002B4C56EBDD1D libc_start_main Unknown Unknown cice_auscom_1440x 000000000040C6E9 Unknown Unknown Unknown

marshallward commented 6 years ago

It's all working for me, but I would not be surprised if there are problems.

You do need to update all the models here, and even the module files need to be updated, it could be using old object files or module headers.

I'll look at the trace though.

marshallward commented 6 years ago

This looks like a very similar bug that we saw ages ago, where one of these variable (I think thva?) was not being initialized. Equation below:

 349         ! compute stability & evaluate all stability functions                  
 350             hol(ij) = vonkar * gravit * zlvl(i,j) &                             
 351                    * (tstar(ij)/thva(ij) &                                      
 352                     + qstar(ij)/(c1/zvir+Qa(i,j))) &                            
 353                    / ustar(ij)**2                                               

Could this be an old problem that re-emerged?

marshallward commented 6 years ago

Going over the old Slack logs, I encountered a similar problem where thva was being set to zero because it wasnt getting data from MATM. Have you also updated the YATM code here? I had to change all of the models, including YATM and MOM, not just CICE and libaccessom2. It may be that the old YATM is not correctly updating Tair (or whatever) inside of CICE.

If you are using my modified YATM and MOM source though, then none of this is relevant.

nichannah commented 6 years ago

My hunch is that it can happen when there's something up with a coupling field or a grid mismatch.

marshallward commented 6 years ago

Which experiment are you running? I can test it out in my environment.

nichannah commented 6 years ago

I was running 025_jra55_ryf, now moving onto the 1deg.

I have a feeling we'll get rid of the intercommunicators now that we have this way of doing things. Do see any reason to hold on to the intercommunicators?

nichannah commented 6 years ago

With 1deg_jra55_ryf I get a different error:

forrtl: error (72): floating overflow Image PC Routine Line Source cice_auscom_360x3 000000000093CDB1 Unknown Unknown Unknown cice_auscom_360x3 000000000093AEEB Unknown Unknown Unknown cice_auscom_360x3 00000000008E8D04 Unknown Unknown Unknown cice_auscom_360x3 00000000008E8B16 Unknown Unknown Unknown cice_auscom_360x3 0000000000867149 Unknown Unknown Unknown cice_auscom_360x3 0000000000871ABC Unknown Unknown Unknown libpthread-2.12.s 00002B99D4A8D7E0 Unknown Unknown Unknown cice_auscom_360x3 00000000005464C4 ice_grid_mp_tlatl 1340 ice_grid.f90 cice_auscom_360x3 000000000054CAEB ice_grid_mpinit 390 ice_grid.f90 cice_auscom_360x3 000000000040EECC cice_initmod_mp_c 147 CICE_InitMod.f90 cice_auscom_360x3 000000000040C83D MAIN 54 CICE.f90 cice_auscom_360x3 000000000040C7DE Unknown Unknown Unknown libc-2.12.so 00002B99D4EBDD1D libc_start_main Unknown Unknown cice_auscom_360x3 000000000040C6E9 Unknown Unknown Unknown

marshallward commented 6 years ago

Have you tried with my changes to YATM? Or is this the unmodified version?

On Sun, 1 Jul 2018, 16:26 Nic Hannah, notifications@github.com wrote:

With 1deg_jra55_ryf I get a different error:

forrtl: error (72): floating overflow Image PC Routine Line Source cice_auscom_360x3 000000000093CDB1 Unknown Unknown Unknown cice_auscom_360x3 000000000093AEEB Unknown Unknown Unknown cice_auscom_360x3 00000000008E8D04 Unknown Unknown Unknown cice_auscom_360x3 00000000008E8B16 Unknown Unknown Unknown cice_auscom_360x3 0000000000867149 Unknown Unknown Unknown cice_auscom_360x3 0000000000871ABC Unknown Unknown Unknown libpthread-2.12.s 00002B99D4A8D7E0 Unknown Unknown Unknown cice_auscom_360x3 00000000005464C4 ice_grid_mp_tlatl 1340 ice_grid.f90 cice_auscom_360x3 000000000054CAEB ice_grid_mpinit 390 ice_grid.f90 cice_auscom_360x3 000000000040EECC cice_initmod_mp_c 147 CICE_InitMod.f90 cice_auscom_360x3 000000000040C83D MAIN 54 CICE.f90 cice_auscom_360x3 000000000040C7DE Unknown Unknown Unknown libc-2.12.so 00002B99D4EBDD1D libc_start_main Unknown Unknown cice_auscom_360x3 000000000040C6E9 Unknown Unknown Unknown

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OceansAus/cice5/pull/17#issuecomment-401586017, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcN61ufYShNxpCR4FM7lZfqQG5KwvSDks5uCGuIgaJpZM4Uqqx- .

nichannah commented 6 years ago

I've merged all pull requests.

marshallward commented 6 years ago

I probably should have said in the last post, but I changed the config_sync function's arguments. It could be that everything compiles OK bit that garbage (or zeros) are being passed out read in the function input stack.

marshallward commented 6 years ago

It looks like the MOM changes are still outstanding:

https://github.com/mom-ocean/MOM5/pull/226

Not sure if this is why it's failing in ice_grid though. (Not in a great position to check this trace at the moment.)

Apologies for causing chaos here.

nichannah commented 6 years ago

I haven't pushed the mom changes back yet but I do have them locally.

It's all fine, just keen to get your changes in.

marshallward commented 6 years ago

So I did get the 1deg_jra_ryf run to work using master CICE and libaccessom2 (my changes + yours since my PR), and my local MOM branch (which may be missing any changes by you since then).

I did notice one particular problem where I needed to include an empty &coupler_nml / group inside of accessom2.nml. I tested the flag as true and false, but did not test if the namelist was entirely absent.

Since you are not seeing this error, I wonder if we are running the same code. Maybe your MOM source is different from mine? The only other thing I can think of is that you may need to delete+rebuild your object/mod files for all 3 models.

Edit: @nicjhan mentioned to me that he did see the coupler_nml error and had fixed it himself.

marshallward commented 6 years ago

Following up here, I didn't intend for everyone to add a whole new namelist group to every single accessom2.nml; this was a major oversight on my part. I forgot that a namelist must include ~every group~ the group that it's reading (though not every variable of that group). I'd be up for changing this, maybe adding to the accessom2_nml group somehow.

Also, I haven't commented on whether we should just disable all interconnects and use MPI_COMM_WORLD, because I figure it's better to first ensure that everything is working!

russfiedler commented 6 years ago

@nicjhan Why are all those stack traces pointing back at .f90 files rather than the original .F90 source? Is there a change to how the compilation is done or has cice always had a separate preprocessing step?

Acosta-Goncalves commented 4 years ago

Hi,

I'm running into the same problem (https://github.com/COSIMA/access-om2/issues/173) Nic reported above:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
cice_auscom_1440x 000000000044C657 ice_atmo_mp_atmo_ 352 ice_atmo.f90

And, like Marshall said, it seems to be linked to thva. Did you ever figure out how to solve the issue?

Thanks!