LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

SCR Fortran program fails to link with IBM xlf #287

Open adammoody opened 3 years ago

adammoody commented 3 years ago

Trying a build with the IBM XL compilers, I get an error when the SCR build links test_ckpt.F:

export CC="xlc"

export CFLAGS="-g -O0"
export depsinstalldir=${SCR_PKG}/install
cmake \
  -DCMAKE_INSTALL_PREFIX=${SCR_INSTALL} \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_VERBOSE_MAKEFILE=true \
  -DSCR_RESOURCE_MANAGER=LSF \
  -DWITH_DTCMP_PREFIX=$depsinstalldir \
  -DWITH_SPATH_PREFIX=$depsinstalldir \
  -DWITH_KVTREE_PREFIX=$depsinstalldir \
  -DWITH_AXL_PREFIX=$depsinstalldir \
  -DWITH_RANKSTR_PREFIX=$depsinstalldir \
  -DWITH_REDSET_PREFIX=$depsinstalldir \
  -DWITH_SHUFFILE_PREFIX=$depsinstalldir \
  -DWITH_ER_PREFIX=$depsinstalldir \
  ${SCR_PKG}
make

[100%] Linking Fortran executable test_ckpt_F
xlf  -Wl,-export-dynamic -qthreaded -qhalt=e -g CMakeFiles/test_ckpt_F.dir/test_ckpt.F.o  -o test_ckpt_F  <snip> ../src/libscrf.so 

CMakeFiles/test_ckpt_F.dir/test_ckpt.F.o: In function `test_ckpt_f':
scr/examples/test_ckpt.F:36: undefined reference to `scr_init'
scr/examples/test_ckpt.F:43: undefined reference to `scr_start_restart'
scr/examples/test_ckpt.F:49: undefined reference to `scr_route_file'
scr/examples/test_ckpt.F:58: undefined reference to `scr_complete_restart'
scr/examples/test_ckpt.F:76: undefined reference to `scr_start_checkpoint'

I think this is showing up because our src/scrf.c file that defines our Fortran interface includes a trailing underscore on the function names, but XLF does not add that trailing underscore by default.

bash-4.2$ nm src/libscrf.so | grep scr_init
000000000000de00 T scr_init_

As a work around, it seems that adding -qextname helps:

export FFLAGS="-qextname"
robertkb commented 3 years ago

Here is my lassen build procedure for cmake builds. The _r compiler selector is optional. I'm not sure it matters at all, but may be needed for some application codes.

$ export CC=xlc_r
$ export CXX=xlc++_r
$ export F90=xlf90_r
$ export FFLAGS=-qextname
$ ${SRC_DIR}/bootstrap.sh
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX=${WORK_DIR}/install -DCMAKE_PREFIX_PATH=${WORK_DIR}/install -DSCR_RESOURCE_MANAGER=LSF ${SRC_DIR}
$ make
$ make install

Here is my spack build, including the examples. It requires some editing of the spack compiler find output to get the fortran compiler selection to work. For CI/CD purposes, I put that into a python script. To allow the same packages.yaml to be used on lassen as well as the other compilers, I have to specifically request spectrum-mpi as the compiler, otherwise it tries to use gcc. The gcc compilers would presumably work, but I assume applications on an IBM machine might prefer the IBM compiler.

$ git clone --depth 1 https://github.com/spack/spack.git
$ cd spack
$ . share/spack/setup-env.sh
$ spack compiler find --scope site
$ mv ../../fixcompiler.py .
$ cat fixcompiler.py
$ EDITOR="python fixcompiler.py" spack config --scope site edit compilers
$ spack install --keep-stage scr@develop%xl_r resource_manager=LSF fflags="-qextname" ^spectrum-mpi
$ spack cd -i scr
$ cd share/scr/examples
$ export MPIF90=`which mpifort`
$ export F90FLAGS="-qzerosize -qextname"
$ make

Here is fixcompiler.py

import os, sys, yaml
xlf = os.popen('which xlf').read().strip()
xlfr = os.popen('which xlf_r').read().strip()
cfile = sys.argv[1]

with open(cfile, 'r') as f:
    compilers = yaml.safe_load(f)
        for elem in compilers['compilers']:
            for k, v in elem.items():
                if ('xl@' in v['spec']):
                    v['paths']['fc'] = 'xlf'
                elif ('xl_r@' in v['spec']):
                    v['paths']['fc'] = 'xlf_r'

    with open(cfile, 'w') as f:
       f.write(yaml.dump(compilers))
robertkb commented 3 years ago

I don't remember all the details now, but I think the point of fixcompilers is that spack compiler find picks the wrong IBM Fortran compiler version, so it gets F77 rather than F90 features (or vice-versa, I forget). The IBM compiler is particular about mixing F77 and F90 code, so the spack build fails. The example program has a related issue. The example program contains a mix of F77 and F90 features the 'xlf*' does not like, but by selecting the right xlf version with some flags, it works.

adammoody commented 3 years ago

Thanks for finding and sharing all of those workarounds, @robertkb . That's a bunch of work to have figured all of that out!

For the v3.0 release, we should definitely try to clean up and fix things. A couple of specific work items I see so far:

1) We should find a better solution than requiring users to set -qextname. This adds trailing underscores so that they can link with libscrf, since we have trailing underscores. However, that's annoying, because I think it would require that people use trailing underscores on their whole application build. Maybe we need to create a second libscrf.so which does not have trailing underscores, or we could provide a new configure option in SCR that disables them, or we could pull the scrf.c out of the SCR build and let the user build that themselves.

2) We should check spack compiler find again. If there is still a problem, we can work with the spack team to figure out what's broken. If it's a matter of scr examples using mixed F77 and F90 in one file, we can pick one.

Don't feel you have to do all of this. I'm just trying to list items in one place. We can spread the work among the team.

adammoody commented 3 years ago

Useful references:

We also check source in the MPI libs and the Fortran interfaces defined in the latest MPI standard for ideas. In MVAPICH, some example code that implements Fortran wrappers can be seen in src/binding/fortran

adammoody commented 3 years ago

My first thought was that we could just define a CMake option to let one choose how many underscores they want to append to the symbol names defined in scrf.c. However, one potential problem with that is if one disables underscores, like what xlf needs, there may be symbol conflicts. There are some internal functions in libscr.so that use all lowercase names, like scr_route_file and scr_start_output. We'd need to make sure the Fortran program doesn't try to link with those instead.

Another option would be to also change the Fortran interface to use a leading SCRF instead of SCR, so they would be name-spaced differently.

adammoody commented 3 years ago

@rhaas80 , here is the open issue we have on the Fortran symbol names. I wanted to at least bring you on this thread since you have experience here.

Based on what you said, it sounds like we could stick with the existing Fortran names. For that, I think we'd need to: 1) create a new cmake build option so the user can pick whether they want trailing underscores or not and perhaps how many, e.g., one vs two 2) change our internal scr functions including scr_route_file and scr_start_output so as not to conflict with potential lowercase Fortran names

Does it make sense to attempt a single libscrf.so that provides all common variants for Fortran symbols, or is that crazy talk?

scr_init
scr_init_
scr_init__
SCR_INIT
SCR_INIT_
SCR_INIT__

With that, maybe we could skip the cmake option.

rhaas80 commented 3 years ago

https://github.com/LLNL/scr/pull/327 has a draft solution mostly to see if that is ok or if we want / need something more autoconf like. Right now it just looks at the compiler ID and will not eg consider compiler options for IBM XL.