MPAS-Dev / compass

Configuration Of MPAS Setups
Other
12 stars 37 forks source link

EC(wISC)30to60 performance tests are failing on Perlmutter and Chicoma #497

Closed xylar closed 1 year ago

xylar commented 1 year ago

After the recent module changes on Perlmutter and Chicoma, I'm seeing PIO errors but only for the EC performance tests:

ERROR: MPAS IO Error: Bad return value from PIO
CRITICAL ERROR: Core init failed for core ocean

This is on all cores except 0000.

See:

/pscratch/sd/x/xylar/compass_1.2/test_20230111/ocean_pr/ocean/global_ocean/EC30to60/PHC/performance_test/forward

I tried changing the PIO layout but that didn't make a difference. More debugging is needed.

mark-petersen commented 1 year ago

Note: On perlmutter use the head of compass. On chicoma, use the xylar/add_chicoma-cpu branch

xylar commented 1 year ago

Let's see if https://github.com/E3SM-Project/mache/issues/100 happens to fix this as a first change. We should be able to test this by just adding:

export FI_CXI_RX_MATCH_MODE=software
export MPICH_COLL_SYNC=MPI_Bcast

manually to the load script.

xylar commented 1 year ago

At this point, I'm not seeing the PIO error but the EC test is jub hanging on Chicoma.

xylar commented 1 year ago

@mark-petersen, as I test #555, this and the probably related issue #500 are really giving me trouble. I could use some help debugging them.

In every case that I'm seeing these issues, it's with Gnu compilers (not sure if that's a coincidence or not). It shows up in PIO in some cases and just as hanging in others.

xylar commented 1 year ago

This issue makes the pr test suite not useful on Perlmutter and Chicoma at all, and limits its usefulness on other machines where Gnu isn't our primary compiler but where we do want to fully support it.

xylar commented 1 year ago

The latest example of this on Perlmutter can be found at:

/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/EC30to60/PHC/performance_test/forward
/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward
mark-petersen commented 1 year ago

I think this issue is the same as https://github.com/MPAS-Dev/compass/issues/500. I just fixed the hang with https://github.com/E3SM-Project/E3SM/pull/5575. We can retest the pr suite with that to see if there remains a PIO issue.

xylar commented 1 year ago

As I commented here https://github.com/E3SM-Project/E3SM/pull/5575#issuecomment-1505092914, unfortunately, I don't think that branch has fixed this problem, although it does seem to have fixed #500.

xylar commented 1 year ago

The pr suite runs on Perlmutter with the fix in https://github.com/E3SM-Project/E3SM/pull/5610. I believe we can close this as soon as that gets merged and I update the `E3SM-Project submodule.