ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
309 stars 312 forks source link

Single-point mode incompatible with parallel I/O Netcdf? Suggestion to update documentation. #614

Open thiagoveloso opened 5 years ago

thiagoveloso commented 5 years ago

I recently ported CLM5 to my Linux department's cluster and my MacPro computer with the intention of running a single-point case for the US-UMB Fluxnet site as described in section 1.7.2.1 of the CLM5 User's guide (https://escomp.github.io/ctsm-docs/doc/build/html/users_guide/running-PTCLM/introduction-to-ptclm.html#details-of-ptclmmkdata).

In both machines the case would compile successfully but would not run, with vague log messages (as further detailed here https://github.com/ESMCI/cime/issues/2929#issuecomment-452160826).

I found out that the reason for the crash was the Hdf5 and Netcdf libraries compiled with parallel I/O support. Once I built them with NO parallel I/O support or whatsoever, the case ran just fine.

Since I guess many users will try to use the same libraries for single-point cases as they use for regional/global runs (like I did), I think it would be useful to have a note or warning on that issue in the User's Guide, perhaps at the page: https://escomp.github.io/ctsm-docs/doc/build/html/users_guide/running-single-points/running-pts_mode-configurations.html#running-in-a-single-processor.

serbinsh commented 5 years ago

Is the idea that having a version of the mpi-serial lib on the machine and the ability to switch the library in the xmlchange step the way to allow for compiling CLM with parallel libs but to run for single site runs? I ask because 1) I am struggling to get mpi-serial to compile due to a Makefile issue and 2) I was trying to build a container that allows to run using ./case_submit as well as running the exe directly (to allow submitting the job within a Docker container to the host queue manager) .

Perhaps I would instead need a version, as described above, that buils HDF and netCDF in serial not parallel?

serbinsh commented 5 years ago

@thiagoveloso would you be willing to share your machine files for your non-parallel builds? Trying to mirror your setup to test submitting ensembles of runs within Docker containers.

thiagoveloso commented 5 years ago

@serbinsh which files would you need specifically? The CLM code or the libraries I built in serial mode?

serbinsh commented 5 years ago

@thiagoveloso for now, just your versions of config_machines and config_compilers and perhaps a summary of how you built the libs. Finally, how you are calling the model for your serial runs. running the cesm.exe directly or still running with mpirun though in serial?

Do those questions make sense?

We could take this specific discussion offline: sserbin@bnl.gov

thiagoveloso commented 5 years ago

@serbinsh Yeah, the questions do make sense. Let me revisit my scripts and I'll get back to you right away. But I am quite sure that the single-point simulations are ran using mpirun with only one processor.

Again, I will double check it and write you an e-mail with the configuration files.

EDIT @serbinsh My bad, it runs directly from cesm.exe. See below the output of preview_run:

thiagods@somewhere:~/cesm-cases/experiments/PTCLM5BGC$ ./preview_run
CASE INFO:
  nodes: 1
  total tasks: 1
  tasks per node: 1
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

MPIRUN:
    /no_backup/GroupData/CLM/scratch/PTCLM5BGC/bld/cesm.exe  >> cesm.log.$LID 2>&1
glemieux commented 5 years ago

Just wanted to note that I ran into this problem on the fates development workstation, Lobata, as well. Lobata has two Xeon processors so assumed I would need to setup the hdf and netcdf builds with parallel support (including pnetcdf). Running tests and cases with openmpi worked just fine, but trying to run serially brought up the error that came up in the original thread.

~Weird thought from all this: I'm wondering if this would have been avoided if I had installed slurm as a single node server install?~ UPDATE: in talking with @ekluzek, he noted that modules help handle the loading of specific libraries (as found in the env_mach_specific.xml file).