Closed xylar closed 1 year ago
Note: On perlmutter use the head of compass. On chicoma, use the xylar/add_chicoma-cpu branch
Let's see if https://github.com/E3SM-Project/mache/issues/100 happens to fix this as a first change. We should be able to test this by just adding:
export FI_CXI_RX_MATCH_MODE=software
export MPICH_COLL_SYNC=MPI_Bcast
manually to the load script.
At this point, I'm not seeing the PIO error but the EC test is jub hanging on Chicoma.
@mark-petersen, as I test #555, this and the probably related issue #500 are really giving me trouble. I could use some help debugging them.
In every case that I'm seeing these issues, it's with Gnu compilers (not sure if that's a coincidence or not). It shows up in PIO in some cases and just as hanging in others.
This issue makes the pr
test suite not useful on Perlmutter and Chicoma at all, and limits its usefulness on other machines where Gnu isn't our primary compiler but where we do want to fully support it.
The latest example of this on Perlmutter can be found at:
/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/EC30to60/PHC/performance_test/forward
/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward
I think this issue is the same as https://github.com/MPAS-Dev/compass/issues/500. I just fixed the hang with https://github.com/E3SM-Project/E3SM/pull/5575. We can retest the pr
suite with that to see if there remains a PIO issue.
As I commented here https://github.com/E3SM-Project/E3SM/pull/5575#issuecomment-1505092914, unfortunately, I don't think that branch has fixed this problem, although it does seem to have fixed #500.
The pr
suite runs on Perlmutter with the fix in https://github.com/E3SM-Project/E3SM/pull/5610. I believe we can close this as soon as that gets merged and I update the `E3SM-Project submodule.
After the recent module changes on Perlmutter and Chicoma, I'm seeing PIO errors but only for the EC performance tests:
This is on all cores except 0000.
See:
I tried changing the PIO layout but that didn't make a difference. More debugging is needed.