Open apcraig opened 4 years ago
@apcraig have you tried running the model under Valgrind ? maybe that could help... on our Cray XC system we have a Cray-installed valgrind4hpc
module that may be of help (from a quick Google search it seems to be a standard install on most Cray systems)
Also, the Intel compilers can be installed on personal Linux computers for open-source contributors, so that might be a way to test the Intel 2020 compiler if it's hard to find another machine with it...
Testing was carried out on other machines and the workaround in #460 was removed in #462. We believe the problem arises only on izumi due to some system issues on that particular machine. I have renamed the issue to reflect that.
The intel problem was repeated on Orion using the same problem. This truly does seem to be a compiler bug. We should consider implementing the work around.
I thought initially that orion did not have the issue with Intel20? Do we know what changed? I see that cheyenne has not moved to intel 20 yet, so perhaps there will be a compiler fix in the next update to intel 20?
Just reading through this - did you try adjusting the OMP_STACKSIZE variable?
I just tried OMP_STACKSIZE of 256M and 1024M. Neither did anything for this.
We just talked about this at the CSEG meeting. There was a lot of pushback claiming it is the CICE code and not intel20. There are three tests that fail for CESM2 on izumi with intel 20. Two of these are failing at the same place in ice_transport_remap.F90 and this is the CICE5 code base. One of the tests is actually failing in POP. Jim Edwards has offered to help us debug this, but we need to come up with a reproducible case for this. In terms of izumi, our lab director has said that izumi is still an important tool for CGD and support will continue on this. They currently have someone from another lab helping with this until a replacement hire is made for Mark.
Thanks for the info @dabail10. It certainly could be an issue with the CICE implementation. It's just odd that this code has existed for 10 years or more and has been run on probably hundreds of different compilers and compiler versions over that time and none has had a problem until this version of intel20. I have spent some time debugging and have a workaround/fix in my back pocket that migrates the subroutine static memory allocation to dynamic. This seems to get rid of the error, although since I don't understand the underlying problem, don't know if it addresses it (if it exists). I'd be happy if someone else wants to take a look! @jedwards4b, would a simple standalone CICE6 case that fails be adequate? Also happy to show you my workaround and talk about my debugging efforts.
460 includes what we believe is a compiler bug workaround in ice_transport_remap, but more analysis needs to be done. As highlighted in #460
We need to test on other machines with the intel20 compiler, understand the problem better (whether a coding issue or simply a coding vulnerability), and try to figure out a more robust solution.