Closed worleyph closed 8 years ago
Update: My confirmation job worked fine (as far as I can tell). Integrated for 5 days and timers indicated that COSP was called. This was with pgi_acc, though I assume that pgi will behave the same. The compset and resolution were:
-compset FC5AV1C-L -res ne30_oEC_cosp
with -cosp added to env_build.xml .
Thanks @worleyph for your detective work! I think this needs to be fixed. @wlin7 : Would you please assign it to somebody more familiar with COSP? It is probably a minor fix but I am not familiar enough with the COSP codes to make that decision.
Thanks @worleyph from me, too. I will take care of the update, with some discussion among cosp task team anyway. For the test, I just confirmed it works with pgi, with threading too. pgi w/o threading work even before this fix.
Now a little off the subject, @singhbalwinder , can you please take a look at the issue #833?
Hi @wlin7 : Yes, I am working on it. So far, i tried setting up simulations on two machines but unfortunately, queues on both of these seems backed up. I will try at some other machine to see if I can reproduce the problem.
Thanks @singhbalwinder . For this cosp timer, it has been decided to remove it altogether. Fixing the initialization alone does allow cosp to work with pgi and threading, but not with pgi_acc. It was only used for tracking the time use during development. I pushed the branch wlin/atm/cosp_notimer to the github. Please integrate it.
This has been fixed in the latest version of master. @wlin7 or @singhbalwinder , could you please add a comment indicating which PR solved this problem? Thanks.
This was fixed with the commit of https://github.com/ACME-Climate/ACME/commit/fc85a8a087f95ada85092683b331b1687e4fee2d for the branch wlin/atm/cosp_notimer.
As reported by @wlin7 in issue #800:
"The ne30L72 test with this bugfix branch runs ok with pgi_acc, but not w/o issues. Default 1280x1 PE-layout is used. The planned AMIP run is intended as a baseline, so I would like to enable COSP. But it failed after 4 steps (as in the issue at #479). It ran ok if cosp was turned off. "
I reproduced (segmentation fault) and tracked it down to
In cosp_constants.F90 are the statements:
so tsim is being initialized with real values even though it is declared as integer.
Then in cosp_simulator.F90 are numerous examples of statements of the form:
I just commented all timing-related logic:
and
and it then worked. I did not try just fixing the tsim initialization, since the tsim timings are not used anywhere - the data is only collected, so is irrelevant for ACME (and just extra overhead).
I'll let the atmosphere group decided what to do about this, but there is clearly one issue that is not Titan-specific, and the above changes do work on Titan in my reproducer. (I am rebuilding without all of my debug writes at the moment; will verify this with a clean run as soon as the job makes it through the Titan queue.)