NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

netcdf and hdf5 installation configuration changes causing problems #119

Closed edwardhartnett closed 3 years ago

edwardhartnett commented 3 years ago

@WenMeng-NOAA reports:

This netcdf and hdf5 installation configuration changes really make me frustrated. I was advised that adding linking option in makefile with "-L$(NETCDF)/lib -lnetcdff -lnetcdf -L$(HDF5_LDFLAGS) $(Z_LIB)" for WCOSS netcdf and hdf5 installed at operational library site. To test with the stack-hpc test version, the makefile was changed with the option "-L$(NETCDF)/lib -lnetcdff -lnetcdf". Then new errors for testing the official version of stack-hpc on hera come out. I spended a lot of time tweaking makefile back and forth. I wish to get a stable and consistent version of hpc-stack libraries. I would think it will also benefit other downstream applications with GNC build capacity.

GeorgeVandenberghe-NOAA commented 3 years ago

This is entirely on us in HPC-STACK. Make decisions on how we will build the various dependencies and what we will set in their modules, stick to them and document and publish them. I've mentioned this before in many other discussion threads.

kgerheiser commented 3 years ago

Not sure where HDF5_LDFLAGS came from.

The test version, it appears, was using shared libraries. On NOAA HPC systems we build static libraries so just linking to NetCDF isn't sufficient.

I would do -L$HDF5_ROOT/lib -lhdf5_hl -lhdf5 and that will work

WenMeng-NOAA commented 3 years ago

HDF5_LDFAGS is an environment variable for hdf5 installed at WCOSS operational library site. It is up to EIB's decision for setting or no setting this variable. It is important to our downstream users that the hpc-stack has stable and consistent installation configuration so that we could reduce a lot of modification with the stack-hpc upgrading.

kgerheiser commented 3 years ago

-L$HDF5_ROOT/lib -lhdf5_hl -lhdf5 will work now and forever.

It's unfortunate that the test version was using shared libs and it broke your build because we specifically use static libraries on NOAA systems.

edwardhartnett commented 3 years ago

@WenMeng-NOAA is the problem you are having with the legacy build system for UPP? In other words, not the CMake build?

WenMeng-NOAA commented 3 years ago

Yes, that's the problem for legacy build system. I may test with cmake build later.

edwardhartnett commented 3 years ago

My understanding is that the legacy build system will be retired. When will that occur?

WenMeng-NOAA commented 3 years ago

@kgerheiser With the option "-L$(NETCDF)/lib -lnetcdff -lnetcdf -L${HDF5_ROOT}/lib -lhdf5_hl -lhdf5", I got errors as:

ld: /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/impi-2018.0.4/hdf5/1.10.6/lib/libhdf5.a(H5Zdeflate.o): undefined reference to symbol 'compress2' //usr/lib64/libz.so.1: error adding symbols: DSO missing from command line make: *** [ncep_post] Error 1

Please advise the fix. Thanks!

WenMeng-NOAA commented 3 years ago

@edwardhartnett Retiring GNC build capacity is my to-do list after we complete switching the UPP dependency libraries to the hpc-stack. A lot of UPP developers have been relying on GNC build capacity right now. I would like to give them a smooth transition.

kgerheiser commented 3 years ago

@WenMeng-NOAA

You need to also link to zlib. -L$ZLIB_ROOT -lz after HDF5

WenMeng-NOAA commented 3 years ago

I set option as: -L$(NETCDF)/lib -lnetcdff -lnetcdf -L${HDF5_ROOT}/lib -lhdf5_hl -lhdf5 -L${ZLIB_ROOT} -lz

Now the executable was successfully built.

WenMeng-NOAA commented 3 years ago

Another issue comes out. The environment variable CRTM_FIX in crtm/2.3.0 module is required for runtime. @Hang-Lei-NOAA Can you add it?

kgerheiser commented 3 years ago

What does CRTM_FIX point to?

WenMeng-NOAA commented 3 years ago

CRTM_FIX should point to fix files directory of crtm. Also CRTM_SRC which points to source code directory is needed. These two environment variables are important for debugging UPP for the issues of simulated satellite radiance, If you look at crtm module installed at WCOSS operational site or non-hpc-stack libraries on hera, these two variables are set.

arunchawla-NOAA commented 3 years ago

@WenMeng-NOAA are the issues you are having with using hpc-stack resolved? I am assuming these are problems only when trying to build UPP as a stand alone code and the in line post library itself is building ok since it uses cmake?

Hang-Lei-NOAA commented 3 years ago

@WenMeng-NOAA The crtm_fix is not included in the hpc-stack installation. It is not a standard hpc-stack solution. Therefore, I cannot set up the variable. You can include the crtm_fix files as a part of your code, as a solution.

WenMeng-NOAA commented 3 years ago

@arunchawla-NOAA Yes, the issues I reported are from the UPP standalone tests. I would assume the in-line post is fine.

WenMeng-NOAA commented 3 years ago

@Hang-Lei-NOAA The crtm library stalled under the hpc-stack without including fix files and source code path doesn't make sense to me. With more NCEP applications adapt the hps-stack, they would send the same requests as the UPP. @arunchawla-NOAA In the future, will the hpc-stack be installed at WCOSS NCO operational library site?

aerorahul commented 3 years ago

@WenMeng-NOAA The fix files (crtm coefficients) are also used in the GSI. NCEPlibs does not install these, because they are not part of the emc_crtm repository. In order to add that to the module file, the path which is currently machine specific, needs to be made generic.

Why does the UPP need crtm source code path? Is it required for compiling? Then why not use the compiled crtm library?

WenMeng-NOAA commented 3 years ago

@aerorahul The UPP compiling and runtime doesn't need crtm source code. The crtm source path would helpful for debugging the issues of the UPP generating simulated satellite radiance process. We get several cases for tracking back in the crtm code.

GeorgeVandenberghe-NOAA commented 3 years ago

CRTM has been an issue for years. The problem with it is that the binary files are large, 4GB and source repositories cannot handle it. The github limit is 2gb and it also gave vlab indigestion. Because of CRTM I never turned my old tarball NCEPLIBS into a repository object. I kept a bunch of CRTM versions and the total distribution added up to 20+ gbytes. Instead the tarball is on HPSS. But I do have source and fix. THey are in $PKG/src and $PKG/fix where $PKG is $NCEPLIBS/crtm/crtm$VERSION. For example

/gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/fix /gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/src

Due to NCO conventions in place when I first snagged crtm and tried to make it portable, the library is /gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/intel/libcrtm_v2.3.0.a and includes are /gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/intel/include/crtm_v2.3.0

That's for a tarball snapshot I built for luna/surge in July prior to HPC-STACK.

Basically we need a way to maintain the huge database of crtm binary files since they won't fit in github and a crtm installation is indeed not complete without them. My way of keeping them in a tarball is not really satisfactory either

GeorgeVandenberghe-NOAA commented 3 years ago

For source, we need to preserve the source directory in $PKG/build or wherever cmake puts it, after build. This is generally a good idea anyway if you ever expect to need to follow a stack trace back to source. Why do we need to explicitly specify a source directory when running a CRTM code if we are debugging through stack traces?

OR does CRTM have it's own diagnostics that detect runtime issues and point to source lines?

GeorgeVandenberghe-NOAA commented 3 years ago

The notation two comments up of crtm$VERSION is incorrect. It's just $VERSION without a crtm prefix

Hang-Lei-NOAA commented 3 years ago

Technically, it is not a problem, since we can use ftp to store large files. But the importance is a decision making on whether hpc-stack will handle it or collect/distribute fix files by models.

On Thu, Dec 3, 2020 at 9:00 AM GeorgeVandenberghe-NOAA < notifications@github.com> wrote:

CRTM has been an issue for years. The problem with it is that the binary files are large, 4GB and source repositories cannot handle it. The github limit is 2gb and it also gave vlab indigestion. Because of CRTM I never turned my old tarball NCEPLIBS into a repository object. I kept a bunch of CRTM versions and the total distribution added up to 20+ gbytes. Instead the tarball is on HPSS. But I do have source and fix. THey are in $PKG/src and $PKG/fix where $PKG is $NCEPLIBS/crtm/crtm$VERSION. For example

/gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/fix /gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/src

Due to NCO conventions in place when I first snagged crtm and tried to make it portable, the library is

/gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/intel/libcrtm_v2.3.0.a and includes are

/gpfs/dell2/emc/modeling/noscrub/cases/l0701/lc/lib/crtm/v2.3.0/intel/include/crtm_v2.3.0

That's for a tarball snapshot I built for luna/surge in July prior to HPC-STACK.

Basically we need a way to maintain the huge database of crtm binary files since they won't fit in github and a crtm installation is indeed not complete without them. My way of keeping them in a tarball is not really satisfactory either

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/119#issuecomment-738011366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFEYXN4FSIVT5AWEQSLSS6KVZANCNFSM4UK2Z47A .

GeorgeVandenberghe-NOAA commented 3 years ago

See github issue for comments https://github.com/NOAA-EMC/hpc-stack/issues/119#

On Thu, Dec 3, 2020 at 8:55 AM WenMeng-NOAA notifications@github.com wrote:

@aerorahul https://github.com/aerorahul The UPP compiling and runtime doesn't need crtm source code. The crtm source path would helpful for debugging the issues of the UPP generating simulated satellite radiance process. We get several cases for tracking back in the crtm code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/119#issuecomment-738008630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQMHNHXEDBP3ZYBSFLSS6KEZANCNFSM4UK2Z47A .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

aerorahul commented 3 years ago

@aerorahul The UPP compiling and runtime doesn't need crtm source code. The crtm source path would helpful for debugging the issues of the UPP generating simulated satellite radiance process. We get several cases for tracking back in the crtm code.

@WenMeng-NOAA If you are using the source code for stack trace and debugging, why can you not use this reference: https://github.com/NOAA-EMC/EMC_crtm/tree/v2.3.0 From what I understand, you are simply trying to identify which line had an error (if debugging a failure). This is the CRTM source code that is being built and installed. No source code is preserved, unless they are being used to compile. It is not a standard practice.

GeorgeVandenberghe-NOAA commented 3 years ago

I would argue that since crtm is a centralized stack installation used by multiple modeling systems, it should be a part of the stack and the installation should be complete, otherwise each modeling system will have to maintain the binary data and supply environment pointers to wherever it is leading to both duplication and confusion. We should use ftp or some other large file API to make it so.

aerorahul commented 3 years ago

@GeorgeVandenberghe-NOAA We are manually making snapshots and maintaining https://github.com/NOAA-EMC/EMC_crtm We should instead use https://github.com/NOAA-EMC/crtm. This is the authoritative CRTM repository with a script to get the binary files from a UCAR hosted FTP site. We should work with the CRTM developers

GeorgeVandenberghe-NOAA commented 3 years ago

THe original source code in the repository is no good for debugging if the compilation process, in particular cpp, changes it before the compiler sees it. Line numbers will not be the same in the stack trace and original source code. For a valid stack trace back to the real failing line of code, we need the final .files that the compiler actually sees. For a F90 repository file, this needs we need to preserve the .f90 made by cpp prior to compilation.

GeorgeVandenberghe-NOAA commented 3 years ago

I agree we need to work with the authoritative crtm repository which now exists. It didn't in 2016 when I tried to make CRTM generally available on NOAA platforms and there was no authoritative site I could get to. At that time I worked from what NCO had already installed and reverse engineered it to work elsewhere. The idea that authoritative sites are inconsistent with what NCO wants, or connectivity to them is broken, comes from previous bitter experience with NOAA systems and yeah, we are a lot better now than we were in 2016

For hpc-stack a part of the crtm install should be getting the binary files from that site and working out any administrative barriers to doing so. There is also a reorganization of the directory structure from how the crtm repository expects the stuff to how NCO wants it also which was confusing on hera and orion when NCEPLIBS first took this on in 2018.

aerorahul commented 3 years ago

@GeorgeVandenberghe-NOAA Good point. But, there is only 1 file that cpp acts on and that is CRTM_Module.F90 and it enables a version number in the compiled library. So the argument that we need to have to save the build directory for crtm is for stack tracing is debunked. There are also ESMF debuggers in our group. What do they do? Perhaps, the better solution would be to build CRTM with debug flags if they want to debug CRTM and place it along side the libcrtm.a e.g. libcrtm.debug.a or something like that.

GeorgeVandenberghe-NOAA commented 3 years ago

One file is true for crtm. It is not for the general issue of installing the general software package. ESMF leaves its source code lying aroung post build so stack traces can point to it and I know UFS developers are exploiting that because they keep asking us for debug versions of the beta release of the week

A case can be made though, our dependency libraries should be reliable enough we don't need to follow stack traces back into the source code in the first place. My gripes about using beta dependency libraries are off topic.. they don't belong in hpc-stack in the first place!

aerorahul commented 3 years ago

@GeorgeVandenberghe-NOAA Build tree and source code do not belong in the central stack install location. For developers who want that level of access/scrutiny, should build their own version of the software and link against it. But that is my opinion, FWIW.

WenMeng-NOAA commented 3 years ago

If the hpc-stack developers decide to remove library source code at hpc-stack, from the downstream user perspective, we would like to get guidance of accessing source code for the trouble-shooting the real-time issues.

kgerheiser commented 3 years ago

We (Hang and I) maintain the source code (and logs) for each installation we do. There's no environment variable, but it's there if someone wants to look.

aerorahul commented 3 years ago

@WenMeng-NOAA Here are your instructions. The compiler module depends on the machine.

git clone https://github.com/noaa-emc/emc_crtm -b v2.3.0
mkdir build
cd build
module load compiler
cmake -DCMAKE_INSTALL_PREFIX=../install ../emc_crtm
make -j 6
make install
arunchawla-NOAA commented 3 years ago

This is an interesting discussion. I want to highlight that we want generalized modules for libraries with relative paths. This is so that we can do lift and replace without breaking anything. I have a question as to why we are using emc_crtm as opposed to crtm if that is the authoritative repository. Is it an access issue?

GeorgeVandenberghe-NOAA commented 3 years ago

"Lift and replace without breaking anything"

That ship has sailed. Modern software packages require absolute hard paths for configuration when they are being used and so must be reconfigured and reinstalled if moved. This is invisible to users but critical for package maintainers.

It has already bitten us hard on WCOSS2 where two filesystem moves required rebuilding of a large chunk of our stacks.

arunchawla-NOAA commented 3 years ago

We are moving into discussions that are going a little off topic. I want to get back to the discussion at hand. The paths that are in the modules for the hpc-stack are relative for easy installation, so there are no paths for things that are not part of the installation. That is why CRTM_FIX and CRTM_SRC are not part of these module files.

@WenMeng-NOAA can you define these in your script level?

A little off topic but I would like to know if we should move to the crtm authoritative repository

kgerheiser commented 3 years ago

CRTM_FIX could be installed and defined as part of hpc-stack. There's a script in the authoritative repository that does just that, but our fork doesn't include it.

GeorgeVandenberghe-NOAA commented 3 years ago

In order for Wen to define CRTM_FIX she needs to know where it is.

Faced with this problem I would log in and start hunting for where NCO put it which defeats the purpose of our own stack.

Can we just go after the definitive crtm repository rather than our own when building CRTM for hpc-stack and include the fix files?

edwardhartnett commented 3 years ago

I will add an issue for going to definitive CTRM.

WenMeng-NOAA commented 3 years ago

@kgerheiser Setting CRTM_FIX in crtm module would be helpful. If CRTM_SRC is not set, I would like to know the path of crtm source for trouble-shooting runtime issues. I usually use "module show" to find out library information.

GeorgeVandenberghe-NOAA commented 3 years ago

The crtm repository script to get the fix files simply hangs on hera.

This kind of stuff is why we need to deal with this at the stack build level rather than having users deal with it and it's also why I maintain a stable tarball rather than trusting NOAA to support access to a repository in a stable reliable way. We have intermittent issues accessing hdf5 this way too.

The repository script to get the fix files works on Jet. At least this week :-( !!!!!

GeorgeVandenberghe-NOAA commented 3 years ago

But assuming access is available, we should run this after compiling the library filename="fix_REL-2.4.0.tgz" #rel 2.4.0 files

if test -f "$filename"; then if [ -d "fix/" ]; then #fix directory exists echo "fix/ already exists, doing nothing." else

untar the file and move directory to fix

                            tar -zxvf $filename
                            mv fix_crtm-internal_develop fix
                            echo "fix/ directory created from existing $file                                                                                                             name file."
fi

else

download, untar, move

            echo "downloading $filename, please wait about 5 minutes (3.2 GB                                                                                                              tar file)"
wget -q ftp://ftp.ucar.edu/pub/cpaess/bjohns/$filename #jedi set of CRTM bin                                                                                                             ary files
tar -zxvf $filename
            mv fix_crtm-internal_develop fix
            echo "fix/ directory created from downloaded $filename."

fi echo "Completed."

And modify the last few lines to put "fix" where we want it.

Also 2.4.0 is NOT our current level so do this for 2.3.1 which is

Point is we should do it once and set the environment variable pointing to it in our module.

aerorahul commented 3 years ago

Is this still an issue?

GeorgeVandenberghe-NOAA commented 3 years ago

no. It's fine now.

On Tue, Feb 16, 2021 at 10:45 AM Rahul Mahajan notifications@github.com wrote:

Is this still an issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/119#issuecomment-779922994, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTUMXNFPMUNBUKUGLDS7KHKZANCNFSM4UK2Z47A .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

aerorahul commented 3 years ago

closing.

WenMeng-NOAA commented 3 years ago

It works in UPP building.