E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
350 stars 360 forks source link

COSP failing on Titan #836

Closed worleyph closed 8 years ago

worleyph commented 8 years ago

As reported by @wlin7 in issue #800:

"The ne30L72 test with this bugfix branch runs ok with pgi_acc, but not w/o issues. Default 1280x1 PE-layout is used. The planned AMIP run is intended as a baseline, so I would like to enable COSP. But it failed after 4 steps (as in the issue at #479). It ran ok if cosp was turned off. "

I reproduced (segmentation fault) and tracked it down to

 ./components/cam/src/physics/cosp/cosp_simulator.F90
 ./components/cam/src/physics/cosp/cosp_constants.F90

In cosp_constants.F90 are the statements:

     integer,dimension(N_SIMULATORS) :: tsim
     data tsim/N_SIMULATORS*0.0/

so tsim is being initialized with real values even though it is declared as integer.

Then in cosp_simulator.F90 are numerous examples of statements of the form:

     call system_clock(t0)
     call cosp_radar(gbx,sgx,sghydro,sgradar)
     call system_clock(t1)
     tsim(isim) = tsim(isim) + (t1 -t0)

I just commented all timing-related logic:

    !pw  integer,dimension(N_SIMULATORS) :: tsim
    !pw data tsim/N_SIMULATORS*0.0/

and

     !pw call system_clock(t0)
     call cosp_radar(gbx,sgx,sghydro,sgradar)
     !pw call system_clock(t1)
     !pw tsim(isim) = tsim(isim) + (t1 -t0)

and it then worked. I did not try just fixing the tsim initialization, since the tsim timings are not used anywhere - the data is only collected, so is irrelevant for ACME (and just extra overhead).

I'll let the atmosphere group decided what to do about this, but there is clearly one issue that is not Titan-specific, and the above changes do work on Titan in my reproducer. (I am rebuilding without all of my debug writes at the moment; will verify this with a clean run as soon as the job makes it through the Titan queue.)

worleyph commented 8 years ago

Update: My confirmation job worked fine (as far as I can tell). Integrated for 5 days and timers indicated that COSP was called. This was with pgi_acc, though I assume that pgi will behave the same. The compset and resolution were:

 -compset FC5AV1C-L -res ne30_oEC_cosp

with -cosp added to env_build.xml .

singhbalwinder commented 8 years ago

Thanks @worleyph for your detective work! I think this needs to be fixed. @wlin7 : Would you please assign it to somebody more familiar with COSP? It is probably a minor fix but I am not familiar enough with the COSP codes to make that decision.

wlin7 commented 8 years ago

Thanks @worleyph from me, too. I will take care of the update, with some discussion among cosp task team anyway. For the test, I just confirmed it works with pgi, with threading too. pgi w/o threading work even before this fix.

Now a little off the subject, @singhbalwinder , can you please take a look at the issue #833?

singhbalwinder commented 8 years ago

Hi @wlin7 : Yes, I am working on it. So far, i tried setting up simulations on two machines but unfortunately, queues on both of these seems backed up. I will try at some other machine to see if I can reproduce the problem.

wlin7 commented 8 years ago

Thanks @singhbalwinder . For this cosp timer, it has been decided to remove it altogether. Fixing the initialization alone does allow cosp to work with pgi and threading, but not with pgi_acc. It was only used for tracking the time use during development. I pushed the branch wlin/atm/cosp_notimer to the github. Please integrate it.

worleyph commented 8 years ago

This has been fixed in the latest version of master. @wlin7 or @singhbalwinder , could you please add a comment indicating which PR solved this problem? Thanks.

wlin7 commented 8 years ago

This was fixed with the commit of https://github.com/ACME-Climate/ACME/commit/fc85a8a087f95ada85092683b331b1687e4fee2d for the branch wlin/atm/cosp_notimer.