E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

Segmentation fault occurred with ne30pg2_ne30pg2 -compset FC5AV1C-L #117

Open guangxinglin opened 4 years ago

guangxinglin commented 4 years ago

The model crashed with a segmentation fault (see the error message below) after I updated my local master to the current ECP master today.

I ran it on Cori machine with ne30pg2_ne30pg2 -compset FC5AV1C-L. The CLM_CONFIG_OPT value is set to "-phys clm4_5 -phys clm4_5 -cppdefs -DMODAL_AER". And the CAM_CONFIG_OPTS is set to "-phys cam5 -use_SPCAM -crm_adv MPDATA -nlev 72 -microphys mg2 -crm_nz 58 -rad rrtmg -chem none -crm_nx 64 -crm_ny 1 -crm_dx 1000 -crm_dt 5 -crm_nx_rad 64 -crm_ny_rad 1 -SPCAM_microp_scheme sam1mom -cppdefs '-DSP_DIR_NS -DSP_MCICA_RAD' -pcols 256".

Has anyone met the same problem? It seems @whannah1 printed something there to check something? Thanks.

2179: whannah - drydep_list(ispec) : H2O2 1160: whannah - drydep_list(ispec) : H2O2 2092: whannah - drydep_list(ispec) : H2O2 5139: whannah - drydep_list(ispec) : H2O2 2847: forrtl: severe (174): SIGSEGV, segmentation fault occurred 2847: Image PC Routine Line Source 2847: e3sm.exe 00000000041D690D Unknown Unknown Unknown 2847: e3sm.exe 0000000003A46300 Unknown Unknown Unknown 2847: e3sm.exe 0000000001FEA632 drydepvelocity_mp 583 DryDepVelocity.F90 2847: e3sm.exe 000000000192B38C clm_driver_mp_clm 1070 clm_driver.F90 2847: e3sm.exe 000000000191A10D lnd_comp_mct_mp_l 509 lnd_comp_mct.F90 2847: e3sm.exe 00000000004243EF component_modmp 737 component_mod.F90 2847: e3sm.exe 0000000000403E84 cime_comp_modmp 2602 cime_comp_mod.F90 2847: e3sm.exe 000000000042406D MAIN__ 133 cime_driver.F90 2847: e3sm.exe 0000000000401ADE Unknown Unknown Unknown 2847: e3sm.exe 00000000042EC5CF Unknown Unknown Unknown 1920: whannah - drydep_list(ispec) : H2O2

whannah1 commented 4 years ago

We should only set pcols for GPU runs on Summit, and recently Matt informed us that 256 is actually too high for our usual use case there. So definitely take that out and retry just in case that's an issue (I kinda doubt it though).

I don't remember why I put those print statements in there... we should probably take them out.

whannah1 commented 4 years ago

Another note about the physgrid. Until the default is changed (E3SM PR submitted) we need to add this to the namelist for any case running the physgrid: se_fv_phys_remap_alg = 1 This selects the high order mapping algorithm developed by Andrew Bradley. The default low-order mapping that I added shouldn't be used with the MMF because it results in a lot of noise.

whannah1 commented 4 years ago

(summarizing some offline conversations here for the record) Jungmin and I were looking into this problem yesterday and determined that this appears to be caused by a change to the "use case" file for the FC5AV1C-L compset in which someone on the E3SM side added a value for "drydep_list". This doesnt affect the SP compsets or the high res E3SM compset.

I see 2 temporary workarounds for this

Not sure if there is a long-term solution to this.

guangxinglin commented 4 years ago

Thanks for debugging this and summarizing it. I have tested the 2 temporary fixs you mentioned. Both of them work well.

For those who don't know "drydep_list", it basically specifies what chemical species are needed to do dry deposition in the model. In the FC5AV1C-L compset, the drydep_list is set to "'H2O2', 'H2SO4', 'SO2'" by default. This is needed because the FC5AV1C-L compset uses "-chem linoz_mam4_resus_mom_soag" configuration, which needs H2O2, H2SO4 and SO2 species. But for our SP1 run, we set "-chem none", which does not do any calculations for these 3 chemical species. So" drydep_list = 'H2O2', 'H2SO4', 'SO2'" caused a problem for our SP1 run. But that would not affect the SP2 run, because SP2 uses "-chem linoz_mam4_resus_mom_soag", the same as FC5AV1C-L compset.

I prefer using the SP1 compset to avoid the problem, in case FC5AV1C-L would have further change in the future, which may bring other conflicts with our SP1 setups.

On Fri, Nov 1, 2019 at 7:26 AM Walter Hannah notifications@github.com wrote:

(summarizing some offline conversations here for the record) Jungmin and I were looking into this problem yesterday and determined that this appears to be caused by a change to the "use case" file for the FC5AV1C-L compset in which someone on the E3SM side added a value for "drydep_list". This doesnt affect the SP compsets or the high res E3SM compset.

I see 2 temporary workarounds for this

  • Use the SP compset when running cases with the CRM and use the "append" feature to modify the CAM_CONFIG_OPTS
  • for running an SP case starting with "FC5AV1C-L" and modifying CAM_CONFIG_OPTS then we need to override the drydep_list setting, which I think we can do by adding an empty value for drydep_list in the user_nl_atm file

Not sure if there is a long-term solution to this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/ACME-ECP/issues/117?email_source=notifications&email_token=AGE4GNCCQASNB7KB4HCIQGLQRQ4BBA5CNFSM4JE4V3GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC3CGUA#issuecomment-548807504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGE4GNBU77UBOCKISRCJTR3QRQ4BBANCNFSM4JE4V3GA .