CICE-Consortium / CICE

Development repository for the CICE sea-ice model
Other
60 stars 132 forks source link

Intel20 compiler issue in ice_transport_remap #461

Open apcraig opened 4 years ago

apcraig commented 4 years ago

460 includes what we believe is a compiler bug workaround in ice_transport_remap, but more analysis needs to be done. As highlighted in #460

ice_transport_remap seems to have persistent seg fault issues, but they appear in different places; there's a comment about one of the omp directives seg faulting, and in the past, I've had to unroll a loop in the transport (I no longer remember which one) in order for optimization to not create a seg fault. Is there a particular (set of) variable(s) that need to be allocated, or is it really all of them? Is there a reason to not allocate here all the time? Would it help to move to a vector version of the transport (e.g. the new unstructured-grid code in MPAS)?

We need to test on other machines with the intel20 compiler, understand the problem better (whether a coding issue or simply a coding vulnerability), and try to figure out a more robust solution.

phil-blain commented 4 years ago

@apcraig have you tried running the model under Valgrind ? maybe that could help... on our Cray XC system we have a Cray-installed valgrind4hpc module that may be of help (from a quick Google search it seems to be a standard install on most Cray systems)

phil-blain commented 4 years ago

Also, the Intel compilers can be installed on personal Linux computers for open-source contributors, so that might be a way to test the Intel 2020 compiler if it's hard to find another machine with it...

apcraig commented 4 years ago

Testing was carried out on other machines and the workaround in #460 was removed in #462. We believe the problem arises only on izumi due to some system issues on that particular machine. I have renamed the issue to reflect that.

apcraig commented 4 years ago

The intel problem was repeated on Orion using the same problem. This truly does seem to be a compiler bug. We should consider implementing the work around.

dabail10 commented 4 years ago

I thought initially that orion did not have the issue with Intel20? Do we know what changed? I see that cheyenne has not moved to intel 20 yet, so perhaps there will be a compiler fix in the next update to intel 20?

jedwards4b commented 4 years ago

Just reading through this - did you try adjusting the OMP_STACKSIZE variable?

dabail10 commented 4 years ago

I just tried OMP_STACKSIZE of 256M and 1024M. Neither did anything for this.

dabail10 commented 4 years ago

We just talked about this at the CSEG meeting. There was a lot of pushback claiming it is the CICE code and not intel20. There are three tests that fail for CESM2 on izumi with intel 20. Two of these are failing at the same place in ice_transport_remap.F90 and this is the CICE5 code base. One of the tests is actually failing in POP. Jim Edwards has offered to help us debug this, but we need to come up with a reproducible case for this. In terms of izumi, our lab director has said that izumi is still an important tool for CGD and support will continue on this. They currently have someone from another lab helping with this until a replacement hire is made for Mark.

apcraig commented 4 years ago

Thanks for the info @dabail10. It certainly could be an issue with the CICE implementation. It's just odd that this code has existed for 10 years or more and has been run on probably hundreds of different compilers and compiler versions over that time and none has had a problem until this version of intel20. I have spent some time debugging and have a workaround/fix in my back pocket that migrates the subroutine static memory allocation to dynamic. This seems to get rid of the error, although since I don't understand the underlying problem, don't know if it addresses it (if it exists). I'd be happy if someone else wants to take a look! @jedwards4b, would a simple standalone CICE6 case that fails be adequate? Also happy to show you my workaround and talk about my debugging efforts.