Open jedwards4b opened 10 months ago
Thanks for posting. Some follow-up questions:
As a suggestion, can reference https://github.com/NCAR/aiml_gpu_ncar_envs for typical configuration settings to ensure TF or PyTorch is installed appropriately.
Thanks @dphow - I was going to ask for your input next, so you beat me to it. :)
It looks like the process can't be contained to the environment.yml file, so it won't be entirely reproducible from this repo, which is a bummer. But that's not on us, but rather the FANG companies.
I also see that this uses openmpi - is it just using TCP communications or is it trying to hook into something like Infiniband, which wouldn't work on Derecho?
I can post that question on the other repo if you prefer.
I would hope that it was useful to more than just cesm. Maybe call it mllib or mlenv? I don't have any specific version requests for pytorch and tensorflow at this time.
Following up on this I tried to build a simple environment using:
conda create --name ml5.6 python=3.10 numpy pytorch tensorflow
This resulted in a ton of warning output in the runtime evironment:
deg0077.hsn.de.hpc.ucar.edu 0: 2024-01-29 13:30:09.544100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
deg0077.hsn.de.hpc.ucar.edu 0: To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/h5py/__init__.py:36: UserWarning: h5py is running against HDF5 1.12.2 when it was built against 1.14.3, this may cause problems
deg0077.hsn.de.hpc.ucar.edu 0: _warn(("h5py is running against HDF5 {0} when it was built against {1}, "
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0: setattr(self, word, getattr(machar, word).flat[0])
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0: return self._float_to_str(self.smallest_subnormal)
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0: setattr(self, word, getattr(machar, word).flat[0])
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0: return self._float_to_str(self.smallest_subnormal)
deg0077.hsn.de.hpc.ucar.edu 0: Warning! ***HDF5 library version mismatched error***
deg0077.hsn.de.hpc.ucar.edu 0: The HDF5 header files used to compile this application do not match
deg0077.hsn.de.hpc.ucar.edu 0: the version used by the HDF5 library to which this application is linked.
deg0077.hsn.de.hpc.ucar.edu 0: Data corruption or segmentation faults may occur if the application continues.
deg0077.hsn.de.hpc.ucar.edu 0: This can happen when an application was compiled by one version of HDF5 but
deg0077.hsn.de.hpc.ucar.edu 0: linked with a different version of static or shared HDF5 library.
deg0077.hsn.de.hpc.ucar.edu 0: You should recompile the application or check your shared library related
deg0077.hsn.de.hpc.ucar.edu 0: settings such as 'LD_LIBRARY_PATH'.
deg0077.hsn.de.hpc.ucar.edu 0: You can, at your own risk, disable this warning by setting the environment
deg0077.hsn.de.hpc.ucar.edu 0: variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
deg0077.hsn.de.hpc.ucar.edu 0: Setting it to 2 or higher will suppress the warning messages totally.
deg0077.hsn.de.hpc.ucar.edu 0: Headers are 1.14.3, library is 1.12.2
deg0077.hsn.de.hpc.ucar.edu 0: SUMMARY OF THE HDF5 CONFIGURATION
deg0077.hsn.de.hpc.ucar.edu 0: =================================
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: General Information:
deg0077.hsn.de.hpc.ucar.edu 0: -------------------
deg0077.hsn.de.hpc.ucar.edu 0: HDF5 Version: 1.12.2
deg0077.hsn.de.hpc.ucar.edu 0: Configured on: 2023-10-26
deg0077.hsn.de.hpc.ucar.edu 0: Configured by: Unix Makefiles
deg0077.hsn.de.hpc.ucar.edu 0: Host system: Linux-5.14.21-150400.24.18-default
deg0077.hsn.de.hpc.ucar.edu 0: Uname information: Linux
deg0077.hsn.de.hpc.ucar.edu 0: Byte sex: little-endian
deg0077.hsn.de.hpc.ucar.edu 0: Installation point: //////////////////////////////////////////////////////////////////////////////////////glade/u/apps/derecho/23.09/spack/opt/spack/hdf5/1.12.2/cray-mpich/8.1.27/oneapi/2023.2.1/avlh
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: Compiling Options:
deg0077.hsn.de.hpc.ucar.edu 0: ------------------
deg0077.hsn.de.hpc.ucar.edu 0: Build Mode: Release
deg0077.hsn.de.hpc.ucar.edu 0: Debugging Symbols: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Asserts: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Profiling: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Optimization Level: OFF
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: Linking Options:
deg0077.hsn.de.hpc.ucar.edu 0: ----------------
deg0077.hsn.de.hpc.ucar.edu 0: Libraries:
deg0077.hsn.de.hpc.ucar.edu 0: Statically Linked Executables: OFF
deg0077.hsn.de.hpc.ucar.edu 0: LDFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: H5_LDFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: AM_LDFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: Extra libraries: m;dl;/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so
deg0077.hsn.de.hpc.ucar.edu 0: Archiver: /usr/bin/ar
deg0077.hsn.de.hpc.ucar.edu 0: Ranlib: /usr/bin/ranlib
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: Languages:
deg0077.hsn.de.hpc.ucar.edu 0: ----------
deg0077.hsn.de.hpc.ucar.edu 0: C: YES
deg0077.hsn.de.hpc.ucar.edu 0: C Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/icx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0: CPPFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: H5_CPPFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: AM_CPPFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: CFLAGS: -std=c99 -Wno-error=implicit-function-declaration
deg0077.hsn.de.hpc.ucar.edu 0: H5_CFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: AM_CFLAGS:
deg0077.hsn.de.hpc.ucar.edu 0: Shared C Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: Static C Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: Fortran: ON
deg0077.hsn.de.hpc.ucar.edu 0: Fortran Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/ifx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0: Fortran Flags:
deg0077.hsn.de.hpc.ucar.edu 0: H5 Fortran Flags:
deg0077.hsn.de.hpc.ucar.edu 0: AM Fortran Flags:
deg0077.hsn.de.hpc.ucar.edu 0: Shared Fortran Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: Static Fortran Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: C++: ON
deg0077.hsn.de.hpc.ucar.edu 0: C++ Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/icpx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0: C++ Flags:
deg0077.hsn.de.hpc.ucar.edu 0: H5 C++ Flags:
deg0077.hsn.de.hpc.ucar.edu 0: AM C++ Flags:
deg0077.hsn.de.hpc.ucar.edu 0: Shared C++ Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: Static C++ Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: JAVA: OFF
deg0077.hsn.de.hpc.ucar.edu 0: JAVA Compiler:
deg0077.hsn.de.hpc.ucar.edu 0:
deg0077.hsn.de.hpc.ucar.edu 0: Features:
deg0077.hsn.de.hpc.ucar.edu 0: ---------
deg0077.hsn.de.hpc.ucar.edu 0: Parallel HDF5: ON
deg0077.hsn.de.hpc.ucar.edu 0: Parallel Filtered Dataset Writes: ON
deg0077.hsn.de.hpc.ucar.edu 0: Large Parallel I/O: ON
deg0077.hsn.de.hpc.ucar.edu 0: High-level library: ON
deg0077.hsn.de.hpc.ucar.edu 0: Dimension scales w/ new references:
deg0077.hsn.de.hpc.ucar.edu 0: Build HDF5 Tests: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Build HDF5 Tools: ON
deg0077.hsn.de.hpc.ucar.edu 0: Build High-level HDF5 Tools: ON
deg0077.hsn.de.hpc.ucar.edu 0: Threadsafety: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Default API mapping: v112
deg0077.hsn.de.hpc.ucar.edu 0: With deprecated public symbols: ON
deg0077.hsn.de.hpc.ucar.edu 0: I/O filters (external): DEFLATE DECODE ENCODE
deg0077.hsn.de.hpc.ucar.edu 0: MPE:
deg0077.hsn.de.hpc.ucar.edu 0: Direct VFD:
deg0077.hsn.de.hpc.ucar.edu 0: Mirror VFD:
deg0077.hsn.de.hpc.ucar.edu 0: (Read-Only) S3 VFD:
deg0077.hsn.de.hpc.ucar.edu 0: (Read-Only) HDFS VFD:
deg0077.hsn.de.hpc.ucar.edu 0: dmalloc:
deg0077.hsn.de.hpc.ucar.edu 0: Packages w/ extra debug output:
deg0077.hsn.de.hpc.ucar.edu 0: API Tracing: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Using memory checker: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Memory allocation sanity checks: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Function Stack Tracing: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Use file locking: best-effort
deg0077.hsn.de.hpc.ucar.edu 0: Strict File Format Checks: OFF
deg0077.hsn.de.hpc.ucar.edu 0: Optimization Instrumentation:
deg0077.hsn.de.hpc.ucar.edu 0: Bye...
after which it aborted. So we either need to match the hdf5 version in the build or see if it's possible to build without it.
Cross-posting this here:
The problem with making a general "ml" Python environment that we support is that there are other stakeholders we'd need to pull in to discuss what should and shouldn't be in an "ML/AI" environment. If it's clearly intended to be for CESM-ML coupling specifically, then it becomes easier to deploy. I'm fine doing either approach and was planning to pursue the former along with Daniel, but both him and I are on PTO for most of this week and we have the outage next week, so realistically getting something like that together wouldn't happen until afterward.
As for "building tensorflow from scratch", that's a real pain to do especially if you want access to a range of versions in a timely manner. Assuming that this is motivated by the HDF5 version difference you shared in the ticket, there are two paths forward that avoid building TF:
mamba create --name ml5.6 python=3.10 numpy pytorch tensorflow hdf5=1.12
Finally - what does the 5.6 versioning mean, @jedwards4b ?
5.6 is the cime branch this is going on, its the cime used in cesm2.1.x and it's compatible with the cmip6 runs so there is plenty of training data out there.
I'm not at all attached to that name and a more general ml environment probably shouldn't use it.
@WillyChap and I have recently implemented a CESM interface to python for use with machine learning. We could not use the npl to do this and had to set up our own conda environment - we would like to request that a new system wide conda environment be provided that contains python 3.10 or older (some of our components are still using imp which is not in 3.11), numpy, pytorch, and tensorflow with these tools optimized for GPU usage if any such optimization is necessary. It cannot include netcdf unless we can figure out how to solve cesm link issues that prevented us from using npl. Our implementation uses forpy to allow calling python from fortran.