GEOS-ESM / GEOSctm

Fixture for chemical transport scenarios
Apache License 2.0
0 stars 2 forks source link

Running GEOS CTM #15

Open JulesKouatchou opened 4 years ago

JulesKouatchou commented 4 years ago

I cloned GEOS CTM and was able to compile it. The ctm_setup script did not properly create the experiment directory because it was still referring to the old configuration (Linux/ instead of install/). I fix the ctm_setup file. The code is crashing during the initialization steps because it cannot create the grid. The code is failing on Line 9193 of MAPL_Generic.F90:

call ESMF_ConfigGetAttribute(state%cf,gridname,label=trim(comp_name)//CF_COMPONENT_SEPARATOR//'GRIDNAME:',rc=status)
VERIFY_(status)

I can quickly understand why there is a problem: the label should only be 'GRIDNAME:'.

I checked a couple of CVS tags I have and could not locate any MAPL version similar to the one in the git repository. I am wondering if MAPL has to be updated before GEOS CTM can run.

kgerheiser commented 4 years ago

I have the CTM running with this version of MAPL. You just need to add something like this into GEOSCTM.rc:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

And for Dynamics:

  DYNAMICS.GRID_TYPE: Cubed-Sphere
  DYNAMICS.GRIDNAME: PE360x2160-CF
  DYNAMICS.NF: 6
  DYNAMICS.LM: 72
  DYNAMICS.IM_WORLD: 360

Though, there is some duplication and it could probably be cleaned up and added to ctm_setup.

JulesKouatchou commented 4 years ago

Thank you. Very interesting. If it works, I will change the set up script...

JulesKouatchou commented 4 years ago

Do you have by chance an experiment directory on discover I can look at? My code is still crashing. Thanks.

kgerheiser commented 4 years ago

I do have an experiment located at: /discover/nobackup/kgerheis/experiments/ctm_test_experiment

I think there might be a few other small things you need to change to get it to run. If you run into any errors I can probably quickly diagnose them as I've gone through this process several times.

JulesKouatchou commented 4 years ago

Kyle: Thank you. I will get back to you if I need more assistance.

JulesKouatchou commented 4 years ago

Kyle:

It seems that I am still missing something as my code continues to crash. My working directory is:

/gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/testTR

I believe that I have a rc setting issue that I cannot identify. I noticed that in my standard output file I have:

In MAPL_Shmem: NumCores per Node = 96 NumNodes in use = 1 Total PEs = 96

In MAPL_InitializeShmem (NodeRootsComm): NumNodes in use = 1

That is not correct as there should be 4 nodes in use.

kgerheiser commented 4 years ago

You need to register the grid for the GridManager before you can create the grid by adding this to AdvCore_GridComp.F90.

subroutine register_grid_and_regridders()
    use MAPL_GridManagerMod, only: grid_manager
    use CubedSphereGridFactoryMod, only: CubedSphereGridFactory
    use MAPL_RegridderManagerMod, only: regridder_manager
    use MAPL_RegridderSpecMod, only: REGRID_METHOD_BILINEAR
    use LatLonToCubeRegridderMod
    use CubeToLatLonRegridderMod
    use CubeToCubeRegridderMod

    type (CubedSphereGridFactory) :: factory

    type (CubeToLatLonRegridder) :: cube_to_latlon_prototype
    type (LatLonToCubeRegridder) :: latlon_to_cube_prototype
    type (CubeToCubeRegridder) :: cube_to_cube_prototype

    call grid_manager%add_prototype('Cubed-Sphere',factory)
    associate (method => REGRID_METHOD_BILINEAR, mgr => regridder_manager)
      call mgr%add_prototype('Cubed-Sphere', 'LatLon', method, cube_to_latlon_prototype)
      call mgr%add_prototype('LatLon', 'Cubed-Sphere', method, latlon_to_cube_prototype)
      call mgr%add_prototype('Cubed-Sphere', 'Cubed-Sphere', method, cube_to_cube_prototype)
    end associate

  end subroutine register_grid_and_regridders

and calling it in AdvCore SetServices (line 318)

if (.NOT. FV3_DynCoreIsRunning) then
         call fv_init2(FV_Atm, dt, grids_on_my_pe, p_split)
         call register_grid_and_regridders() ! add this line
end if

This register_grid_and_regridders routine duplicates code in DynCore and should probably be added to its own module that AdvCore and DynCore can call.

JulesKouatchou commented 4 years ago

Kyle: That helps to go further. Thanks. The code still crashes. I am now running in a debugging mode...

lizziel commented 4 years ago

Jules, keep reporting the problems you are running into since I had to go through these issues as well when getting GCHP to work with the new MAPL. I may be able to help.

lizziel commented 4 years ago

One note of caution so that you do not make the same mistake I did. Regarding the earlier comment to add this to your config file:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

I made the mistake of adding the prefix (in my case GCHP. rather than GEOSctm.) to the existing lines rather than adding them. This caused a silent bug in AdvCore_GridCompMod.F90 since it expects entries for IM, JM, and LM in the file, without prefixes. If those lines are not found then the calculated areas get messed up since it uses default values.

My fix was to change AdvCore_GridCompMod.F90 to expect lines with the prefix rather than without.

-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      ! Customize for GCHP (ewl, 4/9/2019)
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'GCHP.IM:', default= 32, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'GCHP.JM:', default=192, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'GCHP.LM:', default= 72, RC=STATUS )
JulesKouatchou commented 4 years ago

Lizzie,

Thank you for your comments. I am wondering if it was necessary to make any change to my GEOSCTM.rc file as the AdvCore source coide has:

     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )

In any case, my code is still crashing and I still trying to figure out why.

JulesKouatchou commented 4 years ago

Something unusual that I mentioned before is the following print out at the beginning of the code:

In MAPL_Shmem: NumCores per Node = 84 NumNodes in use = 1 Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm): NumNodes in use = 1

I went ahead in MAPL_ShmemMod.F90 and printed the name of all the processors (nodes) after the call: call MPI_AllGather(name ,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,& names,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,Comm,status)

All the entries of the variable "names" only had the name of the head node. Something is wrong but I do not know what. I am sure that that there is a new rc setting I need to add.

mathomp4 commented 4 years ago

@JulesKouatchou This sounds suspiciously like calling MPT with mpirun. How are you running your executable? If you're using MPT (which you probably are), you need to use either mpiexec_mpt or use esma_mpirun (which autodetects MPI stack and will run mpiexec_mpt)

ETA: For everyone, MPT does have an mpirun command but it is weird. It can do some interesting things, but it doesn't perform like you'd expect without a lot of extra work. mpiexec_mpt understands SLURM (ish) and does what is expected.

mathomp4 commented 4 years ago

@JulesKouatchou Actually, I might need to work with you on the ctm_setup script. There are some changes we had to make to gcm_setup for the move to CMake that aren't reflected in that script as I see it on this repo. It's possible it's getting confused and picking the wrong MPI stack or the like.

kgerheiser commented 4 years ago

@mathomp4 That is something I had to change in ctm_run.j, and why I asked you about MPT's mpirun. I changed RUN_CMD to mpiexec

JulesKouatchou commented 4 years ago

@mathomp4
I made changes in my ctm_setup in order to have a complete experiment directory. The script was still referring for instance to the Linux/ directory. In my ctm_run.j file, RUN_CMD is set to "mpirun" not "mpiexec". I will make a change and see what happens.

JulesKouatchou commented 4 years ago

Things are moving in the right directing. I now have as expected:

In MAPL_Shmem: NumCores per Node = 28 NumNodes in use = 3 Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm): NumNodes in use = 3

The code is still crashing but this time in:

GEOSctm.x 00000000048A9E47 fv_statemod_mp_co 3160 FV_StateMod.F90 GEOSctm.x 0000000004868BAF fv_statemod_mp_fv 2902 FV_StateMod.F90 GEOSctm.x 00000000004B2B6B geos_ctmenvgridco 926 GEOS_ctmEnvGridComp.F90

It appears in the manipulations of U & V. My guess is that U & V are not properly read in or I am missing a rc setting somewhere.

tclune commented 4 years ago

@JulesKouatchou Can you tell me which branch of GEOSgcm_GridComp is being used? Lines 3160 and 2902 don't seem right for this type of error. (I wanted to check on an uninitialized pointer that crops up in FV_StateMod from time to time.)

tclune commented 4 years ago

@JulesKouatchou Also - worth learning how to include references to code-snippets in tickets. Easier to show you when you are here, but you can also find it by googling. As an example, I'll show the sections around the line numbes you mentioned above:

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/b4446a6504226c3933fd34f4a842a44ba8b7333b/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/FV_StateMod.F90#L3154-L3168

JulesKouatchou commented 4 years ago

@tclune I do not know how to include references to code-snippets of external components/modules of the CTM. I will try to figure it out.

lizziel commented 4 years ago

@JulesKouatchou Check this out: https://help.github.com/en/articles/creating-a-permanent-link-to-a-code-snippet. Thanks @tclune (I did not know about this).

tclune commented 4 years ago

Note - this used to work better. Rather than a link you would actually see the lines of code in the ticket (and the email). I see that someone has raised the issue with GitHub. Started working wrongly (for some users) 2-3 weeks ago.

tclune commented 4 years ago

Oh - it's because we are linking to text from a different repo. For the same repo it works really nicely. E.g.

https://github.com/GEOS-ESM/GEOSctm/blob/4b2ac7de41c4a34b0a04a36b9f5239a02a7b77db/src/Components/GEOSctm_GridComp/CTMdiffusion_GridComp/GmiDiffusionMethod_mod.F90#L305-L322

lizziel commented 4 years ago

Very nice. Not sure if you guys use Slack, but it shows up nicely in Slack chats as well.

kgerheiser commented 4 years ago

@JulesKouatchou

In fv_computeMassFluxes_r8 you need to initialize uc and vc to 0.0 at the beginning of the subroutine before they are assigned in lines 2916-2917 in FV_StateMod.F90.

  !add these two lines to initialize to 0
  uc = 0d0 
  vc = 0d0 

  uc(is:ie,js:je,:) = ucI
  vc(is:ie,js:je,:) = vcI
JulesKouatchou commented 4 years ago

Kyle,

Thank you for your inputs. I was able to run the CTM code for a day. I now need to run the code longer under various configurations.

tclune commented 4 years ago

@kgerheiser I'm wary that these are not the correct points to do initializations. Was this solution found by trial-and-error, or pulled from some other version of the code?

kgerheiser commented 4 years ago

I found this bug several weeks ago and tracked it down to uninitialized halo values.

The line uc(is:ie,js:je,:) = ucI only initializes the interior of the array, leaving the halos uninitialized. Then, later in the code compute_utvt uses those halo values while they are uninitialized and causes the crash.

kgerheiser commented 4 years ago

I looked at the r4 version of the subroutine and it also assigns the variables to 0 at the same place

lizziel commented 4 years ago

It looks like someone at Harvard added this fix to the GCHP version of FV3 (r8) several years ago. Apologies if it never made it up the chain. I missed it as well when upgrading FV3 recently so will need to add it back in, although it hasn't caused a crash so far.

JulesKouatchou commented 4 years ago

I want to provide an update. The code does not run when using regular compilation options:

Image PC Routine Line Source GEOSctm.x 000000000261DDAE Unknown Unknown Unknown libpthread-2.11.3 00002AAAAEAAF850 Unknown Unknown Unknown GEOSctm.x 00000000015CE155 tp_core_mod_mp_pe 1053 tp_core.F90 GEOSctm.x 00000000015D932C tp_core_mod_mp_yp 915 tp_core.F90 GEOSctm.x 00000000015C2D27 tp_core_mod_mp_fv 165 tp_core.F90 GEOSctm.x 00000000012CB321 fv_statemod_mp_fv 2979 FV_StateMod.F90

When I compile with debugging options, the code runs for 8 days and crashes:

AGCM Date: 2010/02/08 Time: 16:00:00 Throughput(days/day)[Avg Tot Run]: 328.0 310.2 340.4 TimeRemaining(Est) 001:46:49 91.0% Memory Committed Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029. Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.

There seems to be a memory issue. Is there any setting I need to have?

My code is at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/GEOSctm

and my experiment directory at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/testTR

tclune commented 4 years ago

@lizziel has also reported some memory leak issues. @bena-nasa has tried to replicate with a more synthetic use case, but was unsuccessful.

I know that @wmputman has been using the advec core in his latest GCM development, but AFAIK he is using a slightly different version of FV than made it into Git. I.e., the git version of FV was really only vetted with the dycore and we are probably missing various minor fixes in the advec core that were only in CTM tags under CVS.

Someone knowledgeable needs to take a hard look at the diffs between the current FV and working versions in CTM's under CVS.

kgerheiser commented 4 years ago

I can reproduce the crash in tp_core with debugging turned off

kgerheiser commented 4 years ago

I'm not sure what to make of this:

I added a print statement to tp_core where the crash is occurring (divide by zero) to check if the divisor was really small, and adding the print statement made the crash go away.

And it turns out a4 on this line fmin = a0(i) + 0.25/a4*da1**2 + a4*r12 is on the order of 10^-16.

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/9ab6f4e18386d0e7802711f9dd70849080b61f78/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/tp_core.F90#L1054

tclune commented 4 years ago

We seem to be accumulating a sizable number of mysteries in the FV layer recently.

Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement.

I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind?

wmputman commented 4 years ago

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

-- Bill Putman

William M Putman Global Modeling and Assimilation Office NASA Goddard Space Flight Center Cell: 240-778-5697 Desk: 301-286-2599

From: Tom Clune notifications@github.com Reply-To: GEOS-ESM/GEOSctm reply@reply.github.com Date: Wednesday, September 4, 2019 at 1:16 PM To: GEOS-ESM/GEOSctm GEOSctm@noreply.github.com Cc: William Putman william.m.putman@nasa.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] Running GEOS CTM (#15)

We seem to be accumulating a sizable number of mysteries in the FV layer recently.

Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement.

I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAJ3CE6YUFCSDGC36K7UFKFTQH7UNFA5CNFSM4ILMJQY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54J2EY-23issuecomment-2D527998227&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=zc4YovTlH63o4UdZMtWga6uSnzvyqhzy3gPtffS099M&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AJ3CE66BJLMUP4HWB4DXV2DQH7UNFANCNFSM4ILMJQYQ&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=rxXrbpxXeaUHZgub5JkLaY4zmu0zkPhgmnfTViys1g4&e=.

mathomp4 commented 4 years ago

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

@wmputman,

Did you ever try using Rusty's INTERNAL_FILE_NML as seen in GEOS-ESM/GEOSgcm#34? Maybe it's just thousands of cores trying to open the same file that could cause issues?

wmputman commented 4 years ago

The trouble is that sometimes the input.nml file is completely empty, before the executable even begins.

-- Bill Putman

William M Putman Global Modeling and Assimilation Office NASA Goddard Space Flight Center Cell: 240-778-5697 Desk: 301-286-2599

From: Matthew Thompson notifications@github.com Reply-To: GEOS-ESM/GEOSctm reply@reply.github.com Date: Thursday, September 5, 2019 at 3:02 PM To: GEOS-ESM/GEOSctm GEOSctm@noreply.github.com Cc: William Putman william.m.putman@nasa.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] Running GEOS CTM (#15)

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

@wmputmanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wmputman&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=4a_Pf6_xAbVX8NpzU9q21MfFftaKueNnWduu6_FGWJQ&e=,

Did you ever try using Rusty's INTERNAL_FILE_NML as seen in GEOS-ESM/GEOSgcm#34https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSgcm_issues_34&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=WOoXs3v-FXHYTP27olxJZC_Egho5g3mhVGFpYxef7Nk&e=? Maybe it's just thousands of cores trying to open the same file that could cause issues?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAJ3CE62CUZVATY5FRPPKJDLQIFJT7A5CNFSM4ILMJQY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ALDSY-23issuecomment-2D528527819&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=-9QiXCZO8kcmt27rmDgJ45UrXzQ5qamY_y0qXmdlEYU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AJ3CE64XWPO6UKSZTVUOQTTQIFJT7ANCNFSM4ILMJQYQ&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=fSePOJQnpG120cd-0BfFXBDXComTkLp-rtmzL-3QBbo&e=.

mathomp4 commented 4 years ago

The trouble is that sometimes the input.nml file is completely empty, before the executable even begins.

That sounds like an issue in the scripting then. No matter what, we can never not do an append because the coupled model (@yvikhlya) actually has an input.nml file that controls all of MOM. Thus, we need to append the rc file to it.

I could belt-and-suspender it, with scripting. We could put in detection before GEOSxxx.x runs that if input.nml is empty, die, or try the append again? Hmm.

bena-nasa commented 4 years ago

I was able to run the CTM version on github jules pointed me to with a few modifications outlined in the comments here. THis appears to be based on Jason-3_0. I too was only able to run the debugged version but I can confirm that there does appear to be a memory leak. After 4 days the at c90 with the tracer case here is what I am seeing running from 21z on the 1st to 0z on the 5th of the month for the memory use on the root node of my compute session Jason-3_0 based GEOSctm: 18.4% to 56.4% MAPL on the develop branch and MAPL-2.0 branch on other repos 12.9% to 14.4% Version of MAPL off develop that no longer uses CFIO 11.2% to 13.4%

So apparently not using the little cfio does not make a difference? Also very hard to tell how correlated this is to ExtData

tclune commented 4 years ago

That's very encouraging on the memory front.

Might need to up the priority for investigating the no-debug failure now.

JulesKouatchou commented 4 years ago

When I compile in a no-debug mode, the code crashes on Line 999 of:

       FVdycoreCubed_GridComp/fvdycore/model/fv_tracer2d.F90 

It is: if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

If I comment out the line, the code can run. I have not figured out yet why the code crashes in the subroutine "fillz" that is inside:

      FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90

It seems that something is happening in the code segment (Line 87-92):

  do i=1,im
     if( q(i,1,ic) < 0. ) then
         **q(i,2,ic) = q(i,2,ic) + q(i,1,ic)*dp(i,1)/dp(i,2)**
         q(i,1,ic) = 0.
      endif
  enddo
kgerheiser commented 4 years ago

It crashes in fv_tracer2d? Based on your previous post, and from my testing it crashes in tp_core with a floating divide by zero. Also, what does it crash with? segfault, divide by zero, etc?

JulesKouatchou commented 4 years ago

Here is what I am getting:

MPT: #6 0x000000000173a21e in fv_fill_mod::fillz ( MPT: im=<error reading variable: Cannot access memory at address 0x14>, MPT: km=<error reading variable: Cannot access memory at address 0x4>, MPT: nq=<error reading variable: Cannot access memory at address 0x0>, MPT: q=<error reading variable: Cannot access memory at address 0x7fffffff3590>, dp=...) MPT: at /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/GEOSctm/src/Components/GEOSctm_GridComp/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90:89

I read somewhere that one way to remove the "error reading variable: Cannot access memory" error is to change the compilation options. Perhaps it is why the code can run with debugging options turned on.

kgerheiser commented 4 years ago

Looks like it might be MPT related. I was building with Ifort and Intel MPI. Maybe that's why I don't get that bug.

Right now I'm working on building with gfortran and OpenMPI in hopes that will expose some bugs.

mathomp4 commented 4 years ago

@JulesKouatchou Do you know what MPT environment variables are set with the CTM? It's possible one of the many we set for the GCM are needed for the CTM?

mathomp4 commented 4 years ago

Oh wow. I probably need the expertise of @tclune for my question with this code. So we have:

if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

and now the fillz routine:

 subroutine fillz(im, km, nq, q, dp)
   integer,  intent(in):: im                !< No. of longitudes
   integer,  intent(in):: km                !< No. of levels
   integer,  intent(in):: nq                !< Total number of tracers
   real , intent(in)::  dp(im,km)           !< pressure thickness
   real , intent(inout) :: q(im,km,nq)      !< tracer mixing ratio

Now q1 and dp2 are:

real ::   q1(bd%is :bd%ie ,bd%js :bd%je , npz   )! 3D Tracers
      real  dp2(bd%is:bd%ie,bd%js:bd%je,npz)

which means we are sending a weird 2d-slice of q1(:,j,:) to a 3-D q in fillz?

So does Fortran guarantee that q1(is:ie,j,npz) maps equally to q(ie-is+1,npz,1) when passed through the interface of a subroutine? They are the same size in the end (ie-is+1*npz), but ouch. I guess I'd be dumb and fill a temporary (is:ie,npz,1) array with q and pass that in.

JulesKouatchou commented 4 years ago

Matt: I had the same concern. I created two temporary variables loc_q1(ie-is+1,npz,1) and loc_dp2(ie-is+1,npz). The code still crashed at the same location.

kgerheiser commented 4 years ago

My explanation was sort of off before.

They are the same size which is the important thing. A temporary, automatic array, is created in the subroutine of the given size and the data is copied from your source array (which has the same number of elements) to the subroutine array and copied back when it returns. You're basically just interpreting the bounds differently.

Though if they don't match, the code will still happily compile and run.

tclune commented 4 years ago

@mathomp4 Unfortunately, this style is acceptable according to the standard. (But I'd much prefer not to fix F90 and F77 array styles.)

As @kgerheiser explains, the compiler is forced to make copies due to the explicit shape of the dummy arguments. Without aggressive debugging flags, the sizes don't even have to agree.