E3SM-Project / v3atm

Fork of E3SM for testing v3 atm changes
Other
0 stars 5 forks source link

P3 threading fixes from p3_PR in E3SM repo for NGD_v3atm #37

Closed crterai closed 1 year ago

crterai commented 1 year ago

@yunpengshan2014 identified a fix for the threading issue that was seen in the P3 PR. Also in this PR is padding of the dimension variable names that are required for certain machines.

[BFB] for non-threading tests.

crterai commented 1 year ago

Ran the e3sm_v3atm_integration_f20tr_chemuci_linozv3 test on chrysalis and it came back with the following:

20230118_130500_8ommpm: 6 tests
  ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm (Overall: FAIL) details:
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm CREATE_NEWCASE
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm XML
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SETUP
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHAREDLIB_BUILD time=959
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MODEL_BUILD time=5565
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SUBMIT
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm RUN time=697
    FAIL ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm COMPARE_base_rest
    FAIL ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MEMLEAK memleak detected, memory went from 3706.140000 to 4469.500000 in 1 days
    PASS ERP_Ld3.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHORT_TERM_ARCHIVER
  ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm (Overall: FAIL) details:
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm CREATE_NEWCASE
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm XML
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SETUP
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHAREDLIB_BUILD time=355
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MODEL_BUILD time=2917
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SUBMIT
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm RUN time=148
    FAIL ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm COMPARE_base_rest
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MEMLEAK insuffiencient data for memleak test
    PASS ERS_Ld3.ne4pg2_oQU480.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHORT_TERM_ARCHIVER
  ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff (Overall: FAIL) details:
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff CREATE_NEWCASE
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff XML
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff SETUP
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff SHAREDLIB_BUILD time=363
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff MODEL_BUILD time=2893
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff SUBMIT
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff RUN time=213
    FAIL ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff COMPARE_base_rest
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff MEMLEAK insuffiencient data for memleak test
    PASS ERS_Ln11.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm_rtmoff SHORT_TERM_ARCHIVER
  PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm (Overall: FAIL) details:
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm CREATE_NEWCASE
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm XML
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SETUP
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHAREDLIB_BUILD time=964
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MODEL_BUILD time=5561
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SUBMIT
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm RUN time=221
    FAIL PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm COMPARE_base_modpes
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MEMLEAK insuffiencient data for memleak test
    PASS PEM_Ln9.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHORT_TERM_ARCHIVER
  PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm (Overall: FAIL) details:
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm CREATE_NEWCASE
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm XML
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SETUP
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHAREDLIB_BUILD time=395
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MODEL_BUILD time=2952
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SUBMIT
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm RUN time=103
    FAIL PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm COMPARE_base_single_thread
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MEMLEAK insuffiencient data for memleak test
    PASS PET_Ln5.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHORT_TERM_ARCHIVER
  PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm (Overall: FAIL) details:
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm CREATE_NEWCASE
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm XML
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SETUP
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHAREDLIB_BUILD time=392
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MODEL_BUILD time=2956
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SUBMIT
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm RUN time=107
    FAIL PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm COMPARE_base_single_thread
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm MEMLEAK insuffiencient data for memleak test
    PASS PET_Ln5_P256x2.ne30pg2_EC30to60E2r2.F20TR_chemUCI-Linozv3.chrysalis_intel.eam-20tr_v3atm SHORT_TERM_ARCHIVER

Unfortunately, this didn't seem to fix the ERP test.

wlin7 commented 1 year ago

ERP 2nd run also uses half both the mpi tasks and threads, so not a surprised it failed when PET tests are failing.

singhbalwinder commented 1 year ago

I think @yunpengshan2014 ran the ERP test and he got a PASS after this fix. @yunpengshan2014 : Do you think some codes may have been commented out when you tested with ERP?

yunpengshan2014 commented 1 year ago

Hi Balwinder,

The changes in this link are all for the threading issue fix: https://github.com/E3SM-Project/v3atm/pull/37/files. I only changed the variable declaring and initialization then it passed the ERP_D_P2x48.ne4_oQU240.F2010-P3 and PET_D_Ln1_P2x48.ne4_oQU240.F2010-P3 tests.

I did not comment out any code.

Regards, Yunpeng

On Thu, Jan 19, 2023 at 4:21 PM singhbalwinder @.***> wrote:

I think @yunpengshan2014 https://github.com/yunpengshan2014 ran the ERP test and he got a PASS after this fix. @yunpengshan2014 https://github.com/yunpengshan2014 : Do you think some codes may have been commented out when you tested with ERP?

— Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/v3atm/pull/37#issuecomment-1397775067, or unsubscribe https://github.com/notifications/unsubscribe-auth/APW7K6OTJNW7E4ED472XCYTWTHLA5ANCNFSM6AAAAAAT7NOA34 . You are receiving this because you were mentioned.Message ID: @.***>

singhbalwinder commented 1 year ago

The only difference in the ERP tests is the _D flag. Maybe the test is failing with compiler optimizations turned on.

wlin7 commented 1 year ago

NBFB restart issue (ERS test failures) introduced by #34 may be of the reason for the ERP failure.

singhbalwinder commented 1 year ago

Thanks, @wlin7! Yes, that may be it.

wlin7 commented 1 year ago

Merged to NGD_v3atm.