Problems with early versions of PIO2 in CLM

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2015-12-10 12:07:05 -0700 Bugzilla Id: 2256 Bugzilla Depends: 1730, Bugzilla CC: andre, jedwards, mvertens, sacks,

Most CLM tests work fine when CIKME is updated to a version that uses PIO2. But, several have problems. One problem is a hang when creating files.

Here is a list of tests that fail with PIO2 in clm4_5_6_r159

ERP_D_P4x30_Ld5.ne30_g16.ICN.yellowstone_intel.clm-40default ERP_D_Ld5.f19_g16.ICRUCLM50BGC.yellowstone_intel.clm-fire_emis ERP_D_Ld5.hcru_hcru.ICRUCN.yellowstone_pgi.clm-40default SMS_D_Ld5_Mmpi-serial.5x5_amazon.ICLM45ED.yellowstone_pgi.clm-edTest ERS_P192x1_Ld211.f19_g16.ICNDVCROP.yellowstone_intel.clm-crop ERI_Ld9.ne30_g16.I4804.yellowstone_pgi.clm-40default

This is with cime4.3.1 which uses PIO2.0.27.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2015-12-10 12:09:49 -0700

Removing the workaround (avoid_pnetcdf) in bug 1730 removes some issues, but not all.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2015-12-10 12:11:27 -0700

Another workaround is to use PIO1 as follows...

cd cime/externals mv io pio.2 git clone https://github.com/NCAR/ParallelIO.git pio cd pio git checkout pio1_9_23

The ERP_D_Ld5.f19_g16.ICRUCLM50BGC.yellowstone_intel.clm-fire_emis test was shown to work with pio1.9.23.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2015-12-10 12:17:13 -0700

(In reply to Erik Kluzek from comment #1)

Removing the workaround (avoid_pnetcdf) in bug 1730 removes some issues, but not all.

Does this point to a broader problem in PIO2? i.e., why does PIO2 not like it when you use netcdf for some files? Is this a problem with the netcdf interface in general, or just when you have some files that use pnetcdf and some that use netcdf? e.g., if you set the pio type to netcdf for everything, would things work fine in these cases?

ekluzek commented 6 years ago

Jim Edwards < jedwards > - 2015-12-10 12:44:02 -0700

The problem is that in pio2 we have two rearranger methods instead of just one and the default rearranger is subset (the new one) which improves performance of pnetcdf but hurts the serial netcdf performance, so if you want to use netcdf you should use the box rearranger. My sandbox now appears to work without being forced to use serial netcdf for the clm history file - this will be in cime4.3.2

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2015-12-10 14:48:59 -0700

Bunch of tests fail on hobart as well, and it looks like it's this problem (a timeout that happens after the simulation is finished when it's writing a bunch of output).

RUN ERI_D_Ld9_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-noFUN_flexCN.C.151208-160543 RUN SMS.f10_f10.IRCP45CN.hobart_pgi.clm-reduceOutput.C.151208-160543 RUN SMS_Ld5_D_P24x1.f10_f10.IRCP45CLM45BGC.hobart_nag.clm-decStart.C.151208-160543 RUN ERP_D_Ld5_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.C.151208-160543 RUN ERP_D_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-default.C.151208-160543 RUN ERP_D_Ld5.f10_f10.ICN.hobart_pgi.clm-reduceOutput.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-luna.C.151208-160543 RUN ERP_Ld5.f10_f10.I1850CN.hobart_pgi.clm-reduceOutput.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.C.151208-160543 RUN ERI_D_Ld9_P24x1.f10_f10.ICLM45.hobart_nag.clm-SNICARFRC.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso.C.151208-160543 RUN ERI_D_Ld9_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.C.151208-160543 RUN SMS_Ld5_D_P24x1.f10_f10.IHISTCLM45BGC.hobart_nag.clm-clm50BGCmonthly.C.151208-160543 RUN ERP_D_P24x1.f10_f10.IHISTCLM45BGC.hobart_nag.clm-decStart.C.151208-160543 RUN ERS_D.f19_g16_gl5.IGRCP26CN.hobart_pgi.clm-reduceOutput.C.151208-160543 RUN ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-flexCN_FUN.C.151208-160543 RUN ERP_D_Ld5_P24x1.f10_f10.I1850CLM45.hobart_nag.clm-o3.C.151208-160543 RUN SMS_D_Ld5_Mmpi-serial.5x5_amazon.ICLM45ED.hobart_nag.clm-edTest.C.151208-160543 RUN ERI_D_Ld9_P24x1.T31_g37.I1850CLM45.hobart_nag.clm-reduceOutput.C.151208-160543

I verified the timeout in SMS.f10_f10.IRCP45CN.hobart_pgi.clm-reduceOutput.C.151208-160543

I haven't looked at the others. mpi-serial tests with hobart_nag were successful. And it looks like bug 2213 was fixed on hobart_nag in r158.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-05 13:59:19 -0700

Looks like these issues get cleared up with cime4.3.9 (at least on yellowstone).

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-07 17:27:35 -0700

OK, on hobart with clm4_5_7_r164 with cime4.3.9 I still have a list of failures due to the run taking too long (over 2 hours). All of these should finish in a much shorter time than that as they are short simulations. Other cases run to completion in a much shorter time.

ERI_D_Ld9_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7215 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7207 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-noFUN_flexCN.GC.160106-153439 =>> PBS: job killed: walltime 7233 exceeded limit 7200 SMS_Ld5_D_P24x1.f10_f10.IRCP45CLM45BGC.hobart_nag.clm-decStart.GC.160106-153439 =>> PBS: job killed: walltime 7233 exceeded limit 7200 ERP_D_Ld5_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7207 exceeded limit 7200 ERP_D_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso.GC.160106-153439 =>> PBS: job killed: walltime 7232 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-default.GC.160106-153439 =>> PBS: job killed: walltime 7233 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-luna.GC.160106-153439 =>> PBS: job killed: walltime 7241 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7233 exceeded limit 7200 ERI_D_Ld9_P24x1.f10_f10.ICLM45.hobart_nag.clm-SNICARFRC.GC.160106-153439 =>> PBS: job killed: walltime 7204 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso.GC.160106-153439 =>> PBS: job killed: walltime 7232 exceeded limit 7200 ERI_D_Ld9_P24x1.f10_f10.ICLM45BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7234 exceeded limit 7200 SMS_Ld5_D_P24x1.f10_f10.IHISTCLM45BGC.hobart_nag.clm-clm50BGCmonthly.GC.160106-153439 =>> PBS: job killed: walltime 7233 exceeded limit 7200 ERP_D_P24x1.f10_f10.IHISTCLM45BGC.hobart_nag.clm-decStart.GC.160106-153439 =>> PBS: job killed: walltime 7237 exceeded limit 7200 ERP_Ld5_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-flexCN_FUN.GC.160106-153439 =>> PBS: job killed: walltime 7209 exceeded limit 7200 ERP_D_Ld5_P24x1.f10_f10.I1850CLM45.hobart_nag.clm-o3.GC.160106-153439 =>> PBS: job killed: walltime 7204 exceeded limit 7200 ERI_D_Ld9_P24x1.T31_g37.I1850CLM45.hobart_nag.clm-reduceOutput.GC.160106-153439 =>> PBS: job killed: walltime 7204 exceeded limit 7200

ekluzek commented 6 years ago

Jim Edwards < jedwards > - 2016-01-07 20:16:43 -0700

It appears that someone (Bill Sacks according to svn blame) commented out the initdecomp at line 2394 of ncdio_pio.F90.in and replaced it with the older PIO_REARR_BOX version. The variable LEVGRND_CLASS is causing initdecomp to hang when using the PIO_REARR_BOX - this is a bug, but the immediate work around is to replace this call with the PIO_REARR_SUBSET version.

ekluzek commented 6 years ago

Jim Edwards < jedwards > - 2016-01-07 20:29:49 -0700

(In reply to Jim Edwards from comment #8)

It appears that someone (Bill Sacks according to svn blame) commented out the initdecomp at line 2394 of ncdio_pio.F90.in and replaced it with the older PIO_REARR_BOX version. The variable LEVGRND_CLASS is causing initdecomp to hang when using the PIO_REARR_BOX - this is a bug, but the immediate work around is to replace this call with the PIO_REARR_SUBSET version.

I wonder if this also explains the degraded performance that Bill reported recently?

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-07 22:54:09 -0700

(In reply to Jim Edwards from comment #8)

It appears that someone (Bill Sacks according to svn blame) commented out the initdecomp at line 2394 of ncdio_pio.F90.in and replaced it with the older PIO_REARR_BOX version. The variable LEVGRND_CLASS is causing initdecomp to hang when using the PIO_REARR_BOX - this is a bug, but the immediate work around is to replace this call with the PIO_REARR_SUBSET version.

Jim this was the commit of the pio2 branch that Bill brought to the CLM trunk in December of 2014.

r65959 | sacks | 2014-12-03 06:24:30 -0700 (Wed, 03 Dec 2014) | 1 line

merge changes from pio2_dev2 branch: update pio calls to pio2 API

I'll try a few with the PIO_REARR_SUBSET option and see if that goes.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-08 00:08:24 -0700

OK, giving more time to the job does NOT work, but using SUBSET rearranger does! I'll see if the other cases work that failed.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-01-08 05:52:02 -0700

Yes, the initdecomp change was actually Jim's change. I just brought it to the trunk for him. Jim made this change in revision 64202.

ekluzek commented 6 years ago

Jim Edwards < jedwards > - 2016-01-08 07:46:48 -0700

I extracted the decomp for variable LEVGRND_CLASS and ran it in the PIO standalone test suite. Not only does it work fine for both PIO_REARR_BOX and PIO_REARR_SUBSET, but there is also no notable difference in performance. I will continue to investigate.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-08 10:05:07 -0700

OK changing PIO REARR to SUBSET allows all the cases on hobart that failed to work successfully.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-08 14:41:33 -0700

OK, I redid several tests, both on hobart and yellowstone and most (sans-1) ran OK. Although it looks like performance is abysmal for SUBSET, so I'm not sure we want to use it just for that reason. It looks like to me that the performance of PIO2 for CLM is poor compared to PIO1, and subset is even worse.

But, the following KitchenSink test fails on yellowstone...

SMS_Lm1.f09_g16_gl5.IG1850CRUCLM50BGC.yellowstone_intel.clm-clm50KitchenSink

with the following error...

601:Open file /glade/p/cesm/lmwg/atm_forcing.datm7.cruncep_qianFill.0.5d.V4.c130305/TPHWL6Hrly/clmforc.cruncep.V4.c2011.0.5d.TPQWL.1901-01.nc 0 1:Abort(1) on node 1 (rank 1 in comm 1140850688): Fatal error in MPI_Recv: Message truncated, error stack: 1:MPIDI_Buffer_copy(73): Message truncated; 24 bytes received but buffer size is 12 1:INFO: 0031-306 pm_atexit: pm_exit_value is 1. INFO: 0031-251 task 1 exited: rc=1 ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 1

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-01-12 12:06:57 -0700

OK, clm4_5_7_r164 updates to cime4.3.9 and also uses the setting of LND_PIO_REARRANGER rather than hardcoding the rearranger in CLM source (SUBSET for clm40 and BOX for clm45/clm50). The default for LND_PIO_REARRANGER is the same as before. Our testing on hobart runs with some tests set to SUBSET for clm45/clm50 tests that failed. So testing works.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-01-28 11:17:44 -0700

In my branch off of r164, these tests take > 10 hours to complete. I am putting them in the xfail list since I don't typically allow that much time for tests in the test suite:

ERP_D_P4x30_Ld5.ne30_g16.ICN.yellowstone_intel.clm-40default ERS_P192x1_Ld211.f19_g16.ICNDVCROP.yellowstone_intel.clm-crop

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-03-17 14:07:28 -0600

For the workaround in comment2 (using pio1) to work for me (on hobart-nag), I needed to set PIO_REARRANGER to 1; it didn't work to set LND_PIO_REARRANGER to 1.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-03-17 14:11:33 -0600

This test now fails consistently with the pio2 version in CLM, in my branch slated to become r173:

ERP_Ly5.1x1_numaIA.ICRUCLM50BGCCROP.hobart_nag.clm-monthly

It looks like it's dying when writing the .rh1 file.

It passes with pio1, using the workarounds documented in comment 2 and comment 18. A debug version of that test passes with pio2, and both production and debug versions pass with pio2 with all 3 yellowstone compilers. I'm not sure why this started failing all of a sudden.

In addition, this test fails about half the time now; again, I can't tell why the changes on my branch would trigger these sporadic failures:

ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-default

When it fails, it seems to be in writing the .h1 file. Oddly, one traceback pointed to a death in the pnetcdf library, despite the fact that there was a message from CLM saying that it was using the workaround for bug 1730: using netcdf rather than pnetcdf.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2016-06-17 16:52:26 -0600

In clm4_5_8_r181 you can now choose to use PIO1 or PIO2 and PIO1 is the default.

ekluzek commented 5 years ago

I think this is likely not a problem anymore as both CLM and PIO2 have progressed. We should run the latest test list with PIO2 on hobart and cheyenne and just see that we don't have problems though. CESM does want to be moving to PIO2.

billsacks commented 5 years ago

Let's wait to test this until we have the go-ahead from Jim with a suggestion that pio2 should work well now for all CESM use cases and that we want to move to it.

billsacks commented 5 years ago

A fix for PIO2 in DEBUG mode was brought in in ctsm1.0.dev070 (see #810) - though note that I haven't actually run tests with PIO2. I'm not sure whether there are other outstanding problems that will still need to be resolved.

billsacks commented 4 years ago

See #1029 and #1030 for some recent issues with pio2. These arose from testing mpi-serial cases with pio2; that run of the test suite used pio1 for non-mpi-serial cases, so I'm not sure if we would have found other problems in other tests.

billsacks commented 4 years ago

These issues seem to be resolved now (see https://github.com/ESCOMP/CTSM/pull/1095)

ESCOMP / CTSM

Problems with early versions of PIO2 in CLM #124