ESCOMP / mizuRoute

Reach-based river routing model
http://escomp.github.io/mizuRoute/
GNU General Public License v3.0
39 stars 51 forks source link

Refactoring related to PIO decomposition initialization #431

Closed nmizukami closed 8 months ago

nmizukami commented 9 months ago

Move pio decomposition initialization to the model initialization (beginning of the model execution). Idea is to define decomposition data once and share pio system (which decomposition data is attached to) for all the history files and restart file, and avoid defining decomposition data every time the output files are created.

This does not change any science and should re-produce the results from the codes prior to this PR.

resolves #409 resolves #438 resolves #439

nmizukami commented 8 months ago

Hi Erik (@ekluzek), using pioSystem as a pointer did not work. It compiled with both gnu and intel, but did not run for both. No obvious messages (I think it died when it tried to create the file, which uses pioSystem). I am thinking it is the best to leave this as is (so leave pioSystem defined as pio_system type, not pointer) and figure out what to do later since this works both stand-alone and cesm coupling now. or any other idea?

ekluzek commented 8 months ago

Hi Erik (@ekluzek), using pioSystem as a pointer did not work. It compiled with both gnu and intel, but did not run for both. No obvious messages (I think it died when it tried to create the file, which uses pioSystem). I am thinking it is the best to leave this as is (so leave pioSystem defined as pio_system type, not pointer) and figure out what to do later since this works both stand-alone and cesm coupling now. or any other idea?

@nmizukami yeah if it works I guess we can leave it as is. I'm nervous about it because the CESM interface is returning a pointer and we aren't treating it that way. But, if it works, I guess it's OK. Something to check though, is to make sure it works for the compilers we want it to. And I'm also concerned that maybe the reason it works happens to be because of the internals of either PIO or the compilers we use -- so that if either changes -- it'll break. But, we can also handle that when it comes rather than now, especially since this is problematic. I'd also like to show this to Jim Edwards and see if he has thoughts on this. So we should schedule a meeting for that.

ekluzek commented 8 months ago

@nmizukami I ran the tests on cheyenne and they were all fine running as expected. The tests on izumi had some tests pass, but the NAG tests showed problems...

ERS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default RUN SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default RUN SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default RUN

With DEBUG mode turned on in above, you'll see things like dangling pointers caught for example in that middle one.

Since, the last one is the simplest this is what you see captured in the traceback for cesm.log:

Completion(send) value=1072724297 tag=1
Completion(send) value=977411098 tag=1
Completion(send) value=1199980800 tag=1
Completion(sRuntime Error: /fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/src/historyFile.f90, line 407: Reference to dangling pointer THIS%IODESCHRUFLUX
Target was RETURNed from procedure HISTORYFILE:SET_COMPDOF_RCH_HRU
Program terminated by fatal error
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/src/historyFile.f90, line 407: Error occurred in HISTORYFILE:WRITE_FLUX_HRU
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/src/historyFile.f90, line 372: Called by HISTORYFILE:WRITE_FLUX
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/src/write_simoutput_pio.f90, line 210: Called by WRITE_SIMOUTPUT_PIO:OUTPUT
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/cpl/RtmMod.F90, line 676: Called by RTMMOD:ROUTE_RUN
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/mizuRoute/route/build/cpl/nuopc/rof_comp_nuopc.F90, line 792: Called by ROF_COMP_NUOPC:MODELADVANCE
/fs/cgd/data0/erik/ctsm_worktree/mizuroute/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
end) value=1124738608 tag=1
nmizukami commented 8 months ago

Thank you Erik.

So I believe this is nmizukami:cesm-coupling_pio-decomp branch (make sure I am looking at the same code)?

Dangling pointer means that a target is deallocated (or does not exist for some reason)? I wonder if I actually need to clean (nullify the pointer) when a history netcdf is finished and closed before new history file (= before histFile object is created) like this:

    SUBROUTINE cleanup_rch_hru(this)
      implicit none
      class(histFile), intent(inout) :: this

      nullify(this%ioDescRchFlux)
      nullify(this%ioDescHruFlux)

    END SUBROUTINE cleanup_rch_hru

Other than that, I don't see apparent error in the code.

ekluzek commented 8 months ago

Hi @nmizukami. I looked at this some more to see a little bit of what's happening. I was going to try your idea of nullifying pointers, but I realized that this isn't happening at the end -- it's fairly early on. But, it's also after other successful output happens.

Here's a code snippet where this is happening...

      call write_netcdf(this%pioFileDesc, 'time', [hVars_local%timeVar(1)], [this%iTime], [1], ierr, cmessage)
      if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

      call write_netcdf(this%pioFileDesc, 'time_bounds', hVars_local%timeVar, [1,this%iTime], [2,1], ierr, cmessage)
      if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

      if (.not.this%gageOutput) then
        if (meta_hflx(ixHFLX%basRunoff)%varFile) then
          call this%write_flux_hru(hVars_local, ierr, cmessage).    !!! <<<< dying in here
          if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
        endif
      end if

      call this%write_flux_rch(hVars_local, index_write, ierr, cmessage)
      if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

      call sync_file(this%pioFileDesc, ierr, cmessage)
      if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

    END SUBROUTINE write_flux
ekluzek commented 8 months ago

Here's my casedir on izumi...

/scratch/cluster/erik/tests_mizu_c-cpln3_v211_ctsm51d114m/SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default.GC.mizu_c-cpln3_v211_ctsm51d114m_nag

nmizukami commented 8 months ago

Ok, so nullify the pointer did not work. So need to think something else... do you think something like this is related to this error? I was googling with "fortran dangling pointer" and ran into this.

nmizukami commented 8 months ago

Hi Erik (@ekluzek), using pioSystem as a pointer did not work. It compiled with both gnu and intel, but did not run for both. No obvious messages (I think it died when it tried to create the file, which uses pioSystem). I am thinking it is the best to leave this as is (so leave pioSystem defined as pio_system type, not pointer) and figure out what to do later since this works both stand-alone and cesm coupling now. or any other idea?

@nmizukami yeah if it works I guess we can leave it as is. I'm nervous about it because the CESM interface is returning a pointer and we aren't treating it that way. But, if it works, I guess it's OK. Something to check though, is to make sure it works for the compilers we want it to. And I'm also concerned that maybe the reason it works happens to be because of the internals of either PIO or the compilers we use -- so that if either changes -- it'll break. But, we can also handle that when it comes rather than now, especially since this is problematic. I'd also like to show this to Jim Edwards and see if he has thoughts on this. So we should schedule a meeting for that.

Hi Erik, I change pioSystem to a pointer in globalData.f90. As Jim said, allocating pioSystem before calling pio_sys_init works for standalone. cesm-coupling mode also works (I just tested one case - f09_f09_rMERIT with cheyenne intel.

nmizukami commented 8 months ago

Hi Erik, I updated handling of the pio decomposition data for history file output. At least there are no pointers (except for pioSystem), so will need to test with nag compiler in izumi. At cheyenne, it works as expected.

ekluzek commented 8 months ago

@nmizukami I ran that one test on izumi, and it PASSes! So I'll rerun all the testing which hopefully will all work, and then we can merge this and make a tag!

nmizukami commented 8 months ago

@nmizukami I ran that one test on izumi, and it PASSes! So I'll rerun all the testing which hopefully will all work, and then we can merge this and make a tag!

Ah, great! thank you for testing! Yes, we can merge and tag it.

I also want to setup and run ctsm-mizuroute in izumi (something I want to talk about today).

ekluzek commented 8 months ago

Tests on izumi all PASS, the Cheyenne tests are going:

ERI_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default ERS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default PET_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_intel.mizuroute-default SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_intel.mizuroute-default SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_intel.mizuroute-default SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_nag.mizuroute-default

ekluzek commented 8 months ago

OK, tests on Cheyenne are as expected. So this is ready to merge and create a tag.