NCAR / ncar-conda

ncar-conda - YAML inventories for Conda environments on NCAR HPC systems
MIT License
1 stars 1 forks source link

Request for ML conda environment #7

Open jedwards4b opened 10 months ago

jedwards4b commented 10 months ago

@WillyChap and I have recently implemented a CESM interface to python for use with machine learning. We could not use the npl to do this and had to set up our own conda environment - we would like to request that a new system wide conda environment be provided that contains python 3.10 or older (some of our components are still using imp which is not in 3.11), numpy, pytorch, and tensorflow with these tools optimized for GPU usage if any such optimization is necessary. It cannot include netcdf unless we can figure out how to solve cesm link issues that prevented us from using npl. Our implementation uses forpy to allow calling python from fortran.

vanderwb commented 10 months ago

Thanks for posting. Some follow-up questions:

  1. Do you have a preferred name? I could do CESM ML (cesm-ml), but happy to take suggestions.
  2. Do you have target versions or version ranges for pytorch and tensorflow, or do you just want whatever conda picks for Python 3.10 from conda-forge that supports NVIDIA GPUs?
dphow commented 10 months ago

As a suggestion, can reference https://github.com/NCAR/aiml_gpu_ncar_envs for typical configuration settings to ensure TF or PyTorch is installed appropriately.

vanderwb commented 10 months ago

Thanks @dphow - I was going to ask for your input next, so you beat me to it. :)

It looks like the process can't be contained to the environment.yml file, so it won't be entirely reproducible from this repo, which is a bummer. But that's not on us, but rather the FANG companies.

I also see that this uses openmpi - is it just using TCP communications or is it trying to hook into something like Infiniband, which wouldn't work on Derecho?

I can post that question on the other repo if you prefer.

jedwards4b commented 10 months ago

I would hope that it was useful to more than just cesm. Maybe call it mllib or mlenv? I don't have any specific version requests for pytorch and tensorflow at this time.

jedwards4b commented 10 months ago

Following up on this I tried to build a simple environment using: conda create --name ml5.6 python=3.10 numpy pytorch tensorflow This resulted in a ton of warning output in the runtime evironment:

deg0077.hsn.de.hpc.ucar.edu 0: 2024-01-29 13:30:09.544100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
deg0077.hsn.de.hpc.ucar.edu 0: To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/h5py/__init__.py:36: UserWarning: h5py is running against HDF5 1.12.2 when it was built against 1.14.3, this may cause problems
deg0077.hsn.de.hpc.ucar.edu 0:   _warn(("h5py is running against HDF5 {0} when it was built against {1}, "
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0:   setattr(self, word, getattr(machar, word).flat[0])
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0:   return self._float_to_str(self.smallest_subnormal)
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0:   setattr(self, word, getattr(machar, word).flat[0])
deg0077.hsn.de.hpc.ucar.edu 0: /glade/work/jedwards/conda-envs/ml5.6/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
deg0077.hsn.de.hpc.ucar.edu 0:   return self._float_to_str(self.smallest_subnormal)
deg0077.hsn.de.hpc.ucar.edu 0: Warning! ***HDF5 library version mismatched error***
deg0077.hsn.de.hpc.ucar.edu 0: The HDF5 header files used to compile this application do not match
deg0077.hsn.de.hpc.ucar.edu 0: the version used by the HDF5 library to which this application is linked.
deg0077.hsn.de.hpc.ucar.edu 0: Data corruption or segmentation faults may occur if the application continues.
deg0077.hsn.de.hpc.ucar.edu 0: This can happen when an application was compiled by one version of HDF5 but
deg0077.hsn.de.hpc.ucar.edu 0: linked with a different version of static or shared HDF5 library.
deg0077.hsn.de.hpc.ucar.edu 0: You should recompile the application or check your shared library related
deg0077.hsn.de.hpc.ucar.edu 0: settings such as 'LD_LIBRARY_PATH'.
deg0077.hsn.de.hpc.ucar.edu 0: You can, at your own risk, disable this warning by setting the environment
deg0077.hsn.de.hpc.ucar.edu 0: variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
deg0077.hsn.de.hpc.ucar.edu 0: Setting it to 2 or higher will suppress the warning messages totally.
deg0077.hsn.de.hpc.ucar.edu 0: Headers are 1.14.3, library is 1.12.2
deg0077.hsn.de.hpc.ucar.edu 0:         SUMMARY OF THE HDF5 CONFIGURATION
deg0077.hsn.de.hpc.ucar.edu 0:         =================================
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0: General Information:
deg0077.hsn.de.hpc.ucar.edu 0: -------------------
deg0077.hsn.de.hpc.ucar.edu 0:                    HDF5 Version: 1.12.2
deg0077.hsn.de.hpc.ucar.edu 0:                   Configured on: 2023-10-26
deg0077.hsn.de.hpc.ucar.edu 0:                   Configured by: Unix Makefiles
deg0077.hsn.de.hpc.ucar.edu 0:                     Host system: Linux-5.14.21-150400.24.18-default
deg0077.hsn.de.hpc.ucar.edu 0:               Uname information: Linux
deg0077.hsn.de.hpc.ucar.edu 0:                        Byte sex: little-endian
deg0077.hsn.de.hpc.ucar.edu 0:              Installation point: //////////////////////////////////////////////////////////////////////////////////////glade/u/apps/derecho/23.09/spack/opt/spack/hdf5/1.12.2/cray-mpich/8.1.27/oneapi/2023.2.1/avlh
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0: Compiling Options:
deg0077.hsn.de.hpc.ucar.edu 0: ------------------
deg0077.hsn.de.hpc.ucar.edu 0:                      Build Mode: Release
deg0077.hsn.de.hpc.ucar.edu 0:               Debugging Symbols: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                         Asserts: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                       Profiling: OFF
deg0077.hsn.de.hpc.ucar.edu 0:              Optimization Level: OFF
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0: Linking Options:
deg0077.hsn.de.hpc.ucar.edu 0: ----------------
deg0077.hsn.de.hpc.ucar.edu 0:                       Libraries: 
deg0077.hsn.de.hpc.ucar.edu 0:   Statically Linked Executables: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                         LDFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                      H5_LDFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                      AM_LDFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                 Extra libraries: m;dl;/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so
deg0077.hsn.de.hpc.ucar.edu 0:                        Archiver: /usr/bin/ar
deg0077.hsn.de.hpc.ucar.edu 0:                          Ranlib: /usr/bin/ranlib
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0: Languages:
deg0077.hsn.de.hpc.ucar.edu 0: ----------
deg0077.hsn.de.hpc.ucar.edu 0:                               C: YES
deg0077.hsn.de.hpc.ucar.edu 0:                      C Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/icx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0:                        CPPFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                     H5_CPPFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                     AM_CPPFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                          CFLAGS:  -std=c99 -Wno-error=implicit-function-declaration
deg0077.hsn.de.hpc.ucar.edu 0:                       H5_CFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                       AM_CFLAGS: 
deg0077.hsn.de.hpc.ucar.edu 0:                Shared C Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:                Static C Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0:                         Fortran: ON
deg0077.hsn.de.hpc.ucar.edu 0:                Fortran Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/ifx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0:                   Fortran Flags: 
deg0077.hsn.de.hpc.ucar.edu 0:                H5 Fortran Flags: 
deg0077.hsn.de.hpc.ucar.edu 0:                AM Fortran Flags: 
deg0077.hsn.de.hpc.ucar.edu 0:          Shared Fortran Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:          Static Fortran Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0:                             C++: ON
deg0077.hsn.de.hpc.ucar.edu 0:                    C++ Compiler: /glade/u/apps/derecho/23.09/spack/lib/spack/env/oneapi/icpx 2023.2.0
deg0077.hsn.de.hpc.ucar.edu 0:                       C++ Flags:  
deg0077.hsn.de.hpc.ucar.edu 0:                    H5 C++ Flags: 
deg0077.hsn.de.hpc.ucar.edu 0:                    AM C++ Flags: 
deg0077.hsn.de.hpc.ucar.edu 0:              Shared C++ Library: YES
deg0077.hsn.de.hpc.ucar.edu 0:              Static C++ Library: YES
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0:                             JAVA: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                  JAVA Compiler:  
deg0077.hsn.de.hpc.ucar.edu 0: 
deg0077.hsn.de.hpc.ucar.edu 0: Features:
deg0077.hsn.de.hpc.ucar.edu 0: ---------
deg0077.hsn.de.hpc.ucar.edu 0:                      Parallel HDF5: ON
deg0077.hsn.de.hpc.ucar.edu 0:   Parallel Filtered Dataset Writes: ON
deg0077.hsn.de.hpc.ucar.edu 0:                 Large Parallel I/O: ON
deg0077.hsn.de.hpc.ucar.edu 0:                 High-level library: ON
deg0077.hsn.de.hpc.ucar.edu 0: Dimension scales w/ new references: 
deg0077.hsn.de.hpc.ucar.edu 0:                   Build HDF5 Tests: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                   Build HDF5 Tools: ON
deg0077.hsn.de.hpc.ucar.edu 0:        Build High-level HDF5 Tools: ON
deg0077.hsn.de.hpc.ucar.edu 0:                       Threadsafety: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                Default API mapping: v112
deg0077.hsn.de.hpc.ucar.edu 0:     With deprecated public symbols: ON
deg0077.hsn.de.hpc.ucar.edu 0:             I/O filters (external):  DEFLATE DECODE ENCODE
deg0077.hsn.de.hpc.ucar.edu 0:                                MPE: 
deg0077.hsn.de.hpc.ucar.edu 0:                         Direct VFD: 
deg0077.hsn.de.hpc.ucar.edu 0:                         Mirror VFD: 
deg0077.hsn.de.hpc.ucar.edu 0:                 (Read-Only) S3 VFD: 
deg0077.hsn.de.hpc.ucar.edu 0:               (Read-Only) HDFS VFD: 
deg0077.hsn.de.hpc.ucar.edu 0:                            dmalloc: 
deg0077.hsn.de.hpc.ucar.edu 0:     Packages w/ extra debug output: 
deg0077.hsn.de.hpc.ucar.edu 0:                        API Tracing: OFF
deg0077.hsn.de.hpc.ucar.edu 0:               Using memory checker: OFF
deg0077.hsn.de.hpc.ucar.edu 0:    Memory allocation sanity checks: OFF
deg0077.hsn.de.hpc.ucar.edu 0:             Function Stack Tracing: OFF
deg0077.hsn.de.hpc.ucar.edu 0:                   Use file locking: best-effort
deg0077.hsn.de.hpc.ucar.edu 0:          Strict File Format Checks: OFF
deg0077.hsn.de.hpc.ucar.edu 0:       Optimization Instrumentation: 
deg0077.hsn.de.hpc.ucar.edu 0: Bye...

after which it aborted. So we either need to match the hdf5 version in the build or see if it's possible to build without it.

vanderwb commented 10 months ago

Cross-posting this here:

The problem with making a general "ml" Python environment that we support is that there are other stakeholders we'd need to pull in to discuss what should and shouldn't be in an "ML/AI" environment. If it's clearly intended to be for CESM-ML coupling specifically, then it becomes easier to deploy. I'm fine doing either approach and was planning to pursue the former along with Daniel, but both him and I are on PTO for most of this week and we have the outage next week, so realistically getting something like that together wouldn't happen until afterward.

As for "building tensorflow from scratch", that's a real pain to do especially if you want access to a range of versions in a timely manner. Assuming that this is motivated by the HDF5 version difference you shared in the ticket, there are two paths forward that avoid building TF:

  1. netCDF, as of version 4.9.2, finally supports the 1.14 HDF5 API, so I'm fine with switching from 1.12 to 1.14 now in the module stack. I'll test that out and plan to use 1.14 in future deployments.
  2. In the meantime, you should be able to use HDF5 1.12 in your conda environment as follows:
mamba create --name ml5.6 python=3.10 numpy pytorch tensorflow hdf5=1.12

Finally - what does the 5.6 versioning mean, @jedwards4b ?

jedwards4b commented 10 months ago

5.6 is the cime branch this is going on, its the cime used in cesm2.1.x and it's compatible with the cmip6 runs so there is plenty of training data out there.

I'm not at all attached to that name and a more general ml environment probably shouldn't use it.