Closed matthewhoffman closed 7 years ago
@amametjanov or @worleyph, this PR is ready to merge. Please test on mira to verify that this solves #1133, and causes no other issues. Output is no longer written from all ranks. Please also test on any other systems or compilers where text output is a concern. Thanks to @matthewhoffman for a dedicated effort leading this.
FYI, all text messages (log, warning, error) are written with the mpas_log_write
subroutine. If someone mistakenly adds a write (*,*)
or print *
from N processors, it will write N times to the acme.log.* file, at least when I tested on wolf with gnu.
@matthewhoffman , trying this on Titan
-compset A_WCYCL2000 -res ne30_oEC -mach titan -compiler intel
Build is failing with
mpas_log.f90(560): error #6468: Invalid OpenMP* CRITICAL directive section name. [MPAS_LOG_WRITE]
mpas_log.f90(588): error #6413: This global name is invalid in this context. [MPAS_LOG_WRITE]
mpas_log.f90(1035): error #7002: Error in opening the compiled module file. Check INCLUDE paths. [MPAS_LOG]
mpas_log.f90(1071): error #6404: This name does not have a type, and must have an explicit type. [MPAS_LOG_ESCAPE_DOLLARS]
mpas_log.f90(1071): error #6632: Keyword arguments are invalid without an explicit interface. [MESSAGETYPE]
mpas_log.f90(1035): error #6580: Name in only-list does not exist. [MPAS_LOG_WRITE]
mpas_log.f90(1035): error #6580: Name in only-list does not exist. [MPAS_LOG_ESCAPE_DOLLARS]
mpas_log.f90(560): error #6468: Invalid OpenMP* CRITICAL directive section name. [MPAS_LOG_WRITE]
mpas_log.f90(588): error #6413: This global name is invalid in this context. [MPAS_LOG_WRITE]
mpas_log.f90(1035): error #7002: Error in opening the compiled module file. Check INCLUDE paths. [MPAS_LOG]
mpas_log.f90(1071): error #6404: This name does not have a type, and must have an explicit type. [MPAS_LOG_ESCAPE_DOLLARS]
mpas_log.f90(1071): error #6632: Keyword arguments are invalid without an explicit interface. [MESSAGETYPE]
mpas_log.f90(1035): error #6580: Name in only-list does not exist. [MPAS_LOG_WRITE]
mpas_log.f90(1035): error #6580: Name in only-list does not exist. [MPAS_LOG_ESCAPE_DOLLARS]
I'll also try PGI.
The above are the error messages that came the screen. Looking at the build logs, they include identical error messages from the mpas-cice and mpas-o builds.
@worleyph , thanks for trying this. I know this was tested in standalone MPAS using Intel, so it's disappointing to see these errors. We'll start looking into it. If PGI reveals any other issues, let me know.
TItan is still using intel/15. This should probably be updated some day, but I have also had trouble with intel/16 and intel/17 (on Titan). Someone else would need to take on this task - perhaps @minxu74 .
PGI build was successful. I'll submit a job to see if it behaves correctly. Will have to wait until I can "observe" it, so see what happens during the run.
Ok, sounds good. We removed the lines that were opening scratch files, so if nothing else, you should not be seeing those - but I think it is a good idea to manually watch for the behavior you expect to see.
@worleyph , one of the other MPAS developers reports the following:
I've not seen this in my testing. I just tried compiling the 'develop' branch with the Intel 15.0.1 compilers, and I don't get any errors.
It's purely a guess, but perhaps the compiler is unhappy because the critical section has the same name as a subroutine? What if we change this line
!$OMP CRITICAL (mpas_log_write)
to something like this
!$OMP CRITICAL (mpas_log_write_section)
Do you want to give that a try on Titan? I can make a branch with that change for you to try if it's not obvious to you how to do that or don't want to deal with implementing it.
I recompiled the head of ocean/develop ocean core with intel 17 on grizzly and got no errors, just like before. @worleyph could you post the mpas_log.f90 file that has the corresponding lines on your first titan error messages above? I compiled with GEN_F90=true
but my line numbers of the generated mpas_log.f90 don't match your error line numbers, so it is confusing.
My latest build worked after making the suggested changes to mpas_log.F in mpas-o and mpas-cice:
line 558
!$OMP CRITICAL (mpas_log_write_section)
line 586
!$OMP END CRITICAL (mpas_log_write_section)
(I guess that the same change would be needed in mpasli as well?)
Verion of intel compiler is 15.0.2.164 .
PGI job ran, and died during ice initialization:
ice.log contained:
=== Beginning ice_init_mct: rank= 0
----------------------------------------------------------------------
Beginning MPAS-cice Output Log File for task 0 of 512
Opened at 2017/04/26 20:04:10
----------------------------------------------------------------------
tail of acme.log was
...
_pmiu_daemon(SIGCHLD): [NID 03042] [c6-0c1s1n0] [Wed Apr 26 20:04:16 2017] PE RANK 16 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 01429] [c2-2c2s5n1] [Wed Apr 26 20:04:17 2017] PE RANK 151 exit signal Segmentation fault
[NID 01429] 2017-04-26 20:04:13 Apid 14287716: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 01357] [c2-2c2s6n3] [Wed Apr 26 20:04:14 2017] PE RANK 210 exit signal Segmentation fault
[NID 03042] 2017-04-26 20:04:16 Apid 14287716: Error detected during page fault processing. Process terminated via bus error.
...
I'll try again. If it repeats I'll post the case particulars. This is the PGI build and without the cirtical section name modification.
intel build with critical section renamed got beyond the above error. It then hung with the usual MPI_RSEND issue with ocean history rewrite. Rebuilt with PGI and also with the critical section rename and resubmitted (now with config_write_output_on_startup = .false.)
PGI run died in the same way as before.
@mark-petersen , just curious - did you specify to build with threading during your build?
@worleyph , Ok, well that's good that we got past the Intel issue quickly. I'll wait to update the PR until we figure out the PGI issue. By chance, is there a file created something like log.cice.0000.err
? If an MPAS component actually encounters an error, it will create such a file. It doesn't look like it, because the error message would also be included in the standard ice output file.
I didn't see any error files created.
Ok. Presumably it is dying somewhere in here:
https://github.com/ACME-Climate/ACME/blob/f2056d02d57518bf1dd28bc6bd9cf0fef4874c2b/components/mpas-cice/driver/ice_comp_mct.F#L318-L409
318 is the line that opens the log manager and writes the opening lines you pasted above.
409 is the first place where a message is written.
So it is either dying somewhere in that range, or, perhaps, more likely, when a message is attempted to be written (409). I'm not sure what to suggest besides adding some lines like write(iceLogUnit,*) "hello"
to isolate where it is dying.
How do I interpret the following call in ice_comp_mct.F?
if (trim(runtype) /= 'initial') dayOfNextShortwaveCalculation = -1
call mpas_log_write('dayOfNextShortwaveCalculation = $r', realArgs=(/dayOfNextShortwaveCalculation/))
The mpas_log_write declaration is
subroutine mpas_log_write(message, messageType, masterOnly, flushNow, &
intArgs, realArgs, logicArgs, err)
and I am not sure what $r does anyway. (Have to look at my fortran documentation I guess.)
Sorry - been looking at scripts (sh, perl, python) too much - $r is just part of the string? Is it more than this?
Note that failure is at these lines - all ICE component processes make it to just before this, and none of them make it through to the other side (to the extent that I can determine this. I have MPI_barriers both before and after this, so am pretty sure of this diagnosis.)
@worleyph , the $r
is part of a string expansion 'mini-language' we implemented for the log module to allow developers to insert variables into log messages without having to declare a local variable and use a write
statement into that variable before sending the message to the mpas_log_write
. It's documented here: https://docs.google.com/document/d/1Tc1MdnclUVaxi5NQgJ-8FLwmxLhSC3x51BOAgjcMl4k/edit but I don't know that you need to read through it to get the idea.
I will take a closer look at those lines. I am wondering if dayOfNextShortwaveCalculation
has an incorrect data type or something. When I modified this line, I remember thinking it odd that it was a real (and a pointer) and not an integer.
@worleyph , I have an idea.
It seems odd that dayOfNextShortwaveCalculation
is referenced (504-510) before the core is initialized - it could be that the pointer to this variable is still unassociated at this point (I don't recall when the data types get initialized offhand).
We could try moving 500-513 after 515-519 and see if that gets past that error. If so, we would need to make sure that dayOfNextShortwaveCalculation
is not needed in domain % core % core_init(domain, timeStamp)
. (cc'ing @akturner in case he knows offhand)
If that doesn't work, then we might need to debug into mpas_log_write
to find where the seg fault occurs.
@matthewhoffman , tryng your suggestion now.
@matthewhoffman , after moving the code, the run still failed, but now apparently in
ierr = domain % core % core_init(domain, timeStamp)
if ( ierr /= 0 ) then
call mpas_log_write('Core init failed for core ' // trim(domain % core % coreName), MPAS_LOG_CRIT)
end if
Error messages were the same though - seg. faults and bus errors. Perhaps dayOfNextShortwaveCalculation is used in core_init?
My turnaround for these experiments will slow down for the rest of today. I'll keep poking, but would appreciate any more suggestions.
@worleyph , thanks for trying that.
So if it is still dying on call mpas_log_write
that suggests a bug in that routine - even if dayOfNextShortwaveCalculation
was still problematic, that variable is not used in this call to mpas_log_write
, so mpas_log_write
should have been able to run successfully. (That interpretation is also supported by the fact that the current ordering of the code existed prior to the log module introduction and presumably was not causing these seg. faults before.)
The only other thing I can think of right now is to pepper mpas_log_write
with lines like write(mpas_log_info % outputLog % unitNum, *) "hello"
in order to isolate where specifically it is dying.
@matthewhoffman ,
So if it is still dying on call mpas_log_write
that is not clear. It could be dying in the core_init call. (In fact that was my assumption.) I'll make another run to verify.
@worleyph I can now complete our MPAS-O nightly regression suite and a 4/8 threaded bfb comparison with both intel and gnu, both with DEBUG=true OPENMP=true, with the following changes. In ACME you would need to make the chages in all three MPAS cores. We are working on getting these in the MPAS repo now to update this ACME PR.
gr-fe3.lanl.gov> git diff HEAD~2
diff --git a/src/framework/mpas_dmpar.F b/src/framework/mpas_dmpar.F
index bede4fd3a..78ae42dd7 100644
--- a/src/framework/mpas_dmpar.F
+++ b/src/framework/mpas_dmpar.F
@@ -244,8 +244,8 @@ include 'mpif.h'
desiredThreadLevel = MPI_THREAD_MULTIPLE
call MPI_Init_thread(desiredThreadLevel, threadLevel, mpi_ierr)
if ( threadLevel /= desiredThreadLevel ) then
- call mpas_log_write('MPI implementation gave thread level of $i when $i was requested.', &
- MPAS_LOG_WARN, intArgs=(/threadLevel, desiredThreadLevel/) )
+ write(0, *) 'WARNING: In mpas_dmpar_init, MPI implementation gave thread level of ', &
+ threadLevel, ' when ', desiredThreadLevel, ' was requested.'
end if
#else
call MPI_Init(mpi_ierr)
diff --git a/src/framework/mpas_log.F b/src/framework/mpas_log.F
index 423a50afb..06bb342f0 100644
--- a/src/framework/mpas_log.F
+++ b/src/framework/mpas_log.F
@@ -555,7 +555,7 @@ module mpas_log
! --- Actually write the message here! ---
! We have to treat MPAS_LOG_OUT messages separately from others because that type
! does not have a prefix and we want to add a space after the trimmed prefix for other types.
-!$OMP CRITICAL (mpas_log_write)
+!$OMP CRITICAL (mpas_log_write_section)
write(mpas_log_info % outputLog % unitNum, '(a,a,a)') trim(messageExtendedPrefix), ' ', trim(messageExpanded)
! Optionally flush the buffer (only attempt if the file is active)
@@ -583,7 +583,7 @@ module mpas_log
case default
! TODO handle this error? (would have encountered it above...)
end select
-!$OMP END CRITICAL (mpas_log_write)
+!$OMP END CRITICAL (mpas_log_write_section)
endif
endif
@mark-petersen , the above includes more than just the critical section renaming. Does this have anything to do with the PGI bug that I am chasing? I assume not, since I am debugging without threading at the moment? I'll drop it in anyway. Thanks.
@worleyph , Mark discovered that other required change when he did some additional testing this afternoon with threading turned on in standalone MPAS. (It fixes a situation where a call to the log manager occurs before the log manager is initialized - but this situation would only be triggered if threadLevel /= desiredThreadLevel, which is why we didn't expose it previously in our testing.)
We weren't sure if you had threading enabled in the PGI-related problem you referenced above, so we thought there might be a chance this was the fix for that. I guess not!
@matthewhoffman , code dies in core_init if this is called before lines 500-519. Could still be related to dayOfNextShortwaveCalculation?
@matthewhoffman , just saw your new comment. I started out with a mix of threading (none in ocean and 2 threads in sea ice), but moved to no threading in both after I started chasing the PGI issue.
@worleyph , thanks for the update. If it is now dying in core_init or potentially something related to dayOfNextShortwaveCalculation, I'm going to ask @akturner to take a look at this as well - if this may no longer be due to the log manager itself, he is likely to be more familiar with these parts of the code.
By the way, thinking about it again, I'm almost positive that all the MPAS data types are allocated long before core_init, so there should not be a need to move that bit related to dayOfNextShortwaveCalculation after the core_init part.
@worleyph , do you see any potential problem here:
https://github.com/ACME-Climate/ACME/blob/f2056d02d57518bf1dd28bc6bd9cf0fef4874c2b/components/mpas-cice/driver/ice_comp_mct.F#L506
where the variable passed in (dayOfNextShortwaveCalculation) has type real(kind=RKIND), pointer
but the interface is for real(SHR_KIND_R8), optional, intent(OUT) :: nextsw_cday
?
@matthewhoffman , latest failure is in
call mpas_pool_print_summary(domain % configs, MPAS_POOL_CONFIG)
in core_init.
@worleyph , in the ice.log file do you actually see any config options printed? It would basically be a list of namelist options and their values that follow a header line that says ----- namelist options used -----
. If so, can you paste them here or on pastebin or something?
@akturner , can you look at this? Do you do anything unusual with configs, like manually add items to the configs pool or modify them at a low level in any way?
@worleyph , you might also try running the MPASLI_ONLY
compset to see if this issue is specific to the sea ice core, or if it is just showing up here because it runs first. I know that MPASLI calls this same routine.
@matthewhoffman ,
a) OCN init occurs before ICE, so this appears to be particular to ICE.
b) The only thing in ice.log is
=== Beginning ice_init_mct: rank= 0
----------------------------------------------------------------------
Beginning MPAS-cice Output Log File for task 0 of 512
Opened at 2017/04/28 00:52:48
----------------------------------------------------------------------
c) Note that I am running with the modified version of the code, with core_init moved a little earlier. There are lots of mpas_log_write calls in mpas_pool_print_summary. My next run is checking where it dies in there. (in the queue).
@worleyph , ok thanks for the info that OCN runs first - that's good to know.
Note that this line:
https://github.com/ACME-Climate/MPAS/blob/cice/develop/src/core_cice/model_forward/mpas_cice_core.F#L75
should appear in the log file before the call to mpas_pool_print_summary
occurs. However, it is possible that the buffer is not flushed before the model dies, which would confuse the interpretation. If you want, you can force that to occur by adding a flushNow=.true.
argument to the mpas_log_write
call. Actually you might want to add that to these lines as well:
https://github.com/ACME-Climate/MPAS/blob/cice/develop/src/framework/mpas_pool_routines.F#L5646-L5659
Current failure is at line
call mpas_log_write(' ' // trim(poolItr % memberName) // ' = $r', realArgs=(/tempReal/))
so another example of the $r and realArgs . Perhaps this has nothing to do with mpas_log_write propoer, since other calls to that have worked. Any ideas on how to debug the above line?
Sorry - misunderstand what $r was - thought code was expanded before compilation. Looked at the design document, and the code, and see that this is part of the mpas_log code. I'll poke at this.
@worleyph , is it possible that PGI doesn't support the G
format here:
https://github.com/ACME-Climate/MPAS/blob/cice/develop/src/framework/mpas_log.F#L968
I thought it was Fortran standard, but maybe there is something nonstandard about it.
I'm wondering if the code is dying on the first usage of $r
, wherever that happens to occur.
Got me. My current experiment will determine whether the failure is happening inside of log_expand_string. If so, I'll look at this more closely.
Appears to be dying when trying to execute
write(varPart, trim(realFormat)) realArgs(iReal)
(after
write(realFormat, '(a, i2.2, a)') '(G0.', realPrecision, ')'
) Writing out realFormat, realArgs(iReal), and varPart next.
So, what does it mean to write to a null string logical unit?
write(varPart,*) ''
...
write(varPart, trim(realFormat)) realArgs(iReal)
Printed out the variables, and see
realFormat: (G0.15) trim(realFormat): (G0.15) realArgs(iReal): 1.000000000000000 varPart:
And what does it mean to have a '0' width for a G format? Looks like this would not be meaningful?
Sorry - I was looking at the F77 documentation for the G format.
@worleyph , yes, we ended up using more modern options of the format specification. This was the discussion that inspired that format we went with: https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/290952 I am wondering if PGI gets confused by it for some reason.
You could 1. try changing it to a simpler f
format and see if that helps, 2. try a plain old manual write statement with the G format. I do not yet see any obvious problem.
From http://fortranwiki.org/fortran/show/Fortran+2008+status: (last updated in April 2017)
Input/Output Absoft Cray GNU HP IBM Intel NAG Oracle PathScale PGI
g0 edit descriptor N Y Y N N Y Y N N N
so no IBM support either (if correct).
Page indicates that PGI support was not there as late as 16.4 . We are still using 15.7 on Titan.
@worleyph , wow, ok - that's disappointing. Guess we're stuck with the lowest common denominator. I tried looking for something like that table but could not find it - good find.
We can switch it to 'd' or something like that I suppose.
I may not be around for testing the next iteration of this PR, unless you want to provide me with the fix that I can drop into this branch within the next few days. I leave next Thursday for extended travel (and will be distracted a few days before). @minxu74 will need to take over for testing Titan and both Intel and PGI.
@worleyph , ok, thanks. I'll try to get a proposal together quickly, but I want to get feedback from the other MPAS developers before finalizing a choice for this format, so we'll see how long that takes. I appreciate your effort in tracking this down.
This merge introduces the new MPAS log system into ACME. With this change, MPAS no longer hijacks stdout and stderr but instead opens its own log file with a unit number provided by ACME. This merge updates the three MPAS submodules to use the new system in each component and also updates each driver to set up and use it.
With this change, MPAS components no longer uses scratch files to dump the unwanted output.
Fixes #1133.
[BFB]