NaluCFD / Nalu

Nalu: a generalized unstructured massively parallel low Mach flow code designed to support a variety of open applications of interest built on the Sierra Toolkit and Trilinos solver Tpetra solver stack. The open source BSD, clause 3 license model has been chosen for the code base. See LICENSE for more information.
https://github.com/NaluCFD/Nalu
Other
141 stars 66 forks source link

Investigate using Docker for Nalu #32

Closed aprokop closed 7 years ago

aprokop commented 8 years ago

Hmm, I cannot assign labels, or even self-assign for the issue.

aprokop commented 7 years ago

I am close to finishing building TPLs in Docker. I have been ironing bugs in Spack as that is a tool with a lot of potential to simplify our TPLs management. Spack is also part of some ECP projects.

I'm currently fixing the netcdf shared library issue with Spack, but it's close. Once that's done I'll have two layers in Docker: base and TPLs. Next layer would be Trilinos, and then Nalu.

spdomin commented 7 years ago

Okay, let me know how it turns out. I am aware of Spack (as I was about Docker). The real test would be how does it handle a cray build? Do you have access to Cori?

aprokop commented 7 years ago

Do you have access to Cori?

I'll have to check. I used to have access to edison, but haven't used that for a while.

jrood-nrel commented 7 years ago

Hello my name is Jon. I have just begun working on a project with Nalu at NREL. My first order of business is to get Nalu building with Spack and I'm glad to see I'm not the only one with that idea. I previously lead the project to use Spack on the machines at NERSC. I am using the Peregrine machine at NREL at the moment and am trying to iron out issues with all the dependencies I am experiencing thus far. I would be curious if you had any notes of problems you have worked around and what machine you are using.

aprokop commented 7 years ago

@jrood-nrel I have had few troubles with Spack. I'm currently trying to install it in a Docker container with the latest Fedora image. The install image is very minimal, and I just install a stock compiler there which is pretty new (gcc 6.2.1).

I'm trying to use the exact versions of TPLs as specified on this page. Here is my current spack script (the RUN command is from Dockerfile, you can ignore it):

RUN spack spec -I gcc       @4.8.5~binutils && \
    spack install gcc       @4.8.5~binutils
RUN spack compiler find /home/nalu/spack/opt/spack/linux-fedora24-x86_64/**/*

RUN spack spec -I openmpi           @1.8.8                              %gcc@4.8.5 && \
    spack install openmpi           @1.8.8                              %gcc@4.8.5
RUN spack spec -I zlib              @1.2.8                              %gcc@4.8.5 && \
    spack install zlib              @1.2.8                              %gcc@4.8.5
RUN spack spec -I libxml2           @2.9.2      -python                 %gcc@4.8.5  ^zlib@1.2.8 && \
    spack install libxml2           @2.9.2      -python                 %gcc@4.8.5  ^zlib@1.2.8
RUN spack spec -I boost             @1.55.0     +mpi                    %gcc@4.8.5  ^openmpi@1.8.8 && \
    spack install boost             @1.55.0     +mpi                    %gcc@4.8.5  ^openmpi@1.8.8

#RUN spack spec -I hdf5              @1.8.12     +mpi -fortran           %gcc@4.8.5  ^openmpi@1.8.8  ^zlib@1.2.8 && \
#    spack install hdf5              @1.8.12     +mpi -fortran           %gcc@4.8.5  ^openmpi@1.8.8  ^zlib@1.2.8
#RUN spack spec -I parallel-netcdf   @1.6.1      -fortran                %gcc@4.8.5  ^openmpi@1.8.8 && \
#    spack install parallel-netcdf   @1.6.1      -fortran                %gcc@4.8.5  ^openmpi@1.8.8

#RUN spack spec -I netcdf            @4.3.3.1    +mpi +parallel-netcdf -shared   %gcc@4.8.5  ^hdf5@1.8.12 ^parallel-netcdf@1.6.1 ^zlib@1.2.8 && \
#    spack install --no-checksum
                  netcdf            @4.3.3.1    +mpi +parallel-netcdf -shared   %gcc@4.8.5  ^hdf5@1.8.12 ^parallel-netcdf@1.6.1 ^zlib@1.2.8

#RUN spack spec -I cmake     @3.1.0                                      %gcc@4.8.5 && \
#    spack install --no-checksum \
#                  cmake     @3.1.0                                      %gcc@4.8.5

Some things to notice here:

  1. The current cmake in Spack has to be patched to allow 3.1.0 (see llnl/spack#2312).
  2. gcc 4.8.5 has to have ~binutils flags, otherwise it fails to compile (see llnl/spack#2301).
  3. I create a file ~/.spack/packages.yaml to guide the specialization with the following content:
packages: 
    parallel-netcdf: 
        variants: -fortran 
    hdf5: 
        variants: -fortran 

This way I don't have to specify hdf5 and parallel-netcdf explicitly.

  1. I currently have a patched version of netcdf (see aprokop/spack.git netcdf_fix branch). It adds a parallel-netcdf variant.
  2. I currently have trouble installing netcdf, as it fails due to not being able to find -lpnetcdf. I think this is a shared/static libraries issue but I have trouble resolving it. I'll have to dig deeper.
  3. Some packages (cmake, netcdf) have to be run with --no-checksum as they don't currently have the proper hash for the version we want in the spack repo.
aprokop commented 7 years ago

OK, with some wizardry and hacking I was able to build TPLs' docker image. Now trying to build Trilinos with it.

spdomin commented 7 years ago

Sounds good. Progress at NREL with Spack as wel..

aprokop commented 7 years ago

Here is the list of TPLs issues this far: llnl/spack#2312, llnl/spack#2342, llnl/spack#2301, llnl/spack#2360

jrood-nrel commented 7 years ago

Thanks. I am currently up to the linking process for Nalu. My approach is a little different however. I have started by using newer versions of TPLs and then moving backwards if I encounter problems like API changes, etc. I am using GCC 5.2 and CMake 3.6.1, HDF5 1.8.16, Boost 1.62.0, netcdf 4.4.1 and maybe some other differences. I still need to get past the Nalu linking phase, then test the binaries. Then I will iron out my install commands since they are cobbled together at the moment. I will report back again when I at least have a tested build of Nalu using Spack. I will be trying to essentially end up with 'spack install binutils && spack load binutils && spack install nalu' for our Peregrine machine, with only a custom Trilinos package file and custom packages.yaml, compilers.yaml and config.yaml. I have created package files for yaml 0.5.3 and for Nalu and I can commit them to my fork of Spack if you would like them.

aprokop commented 7 years ago

My approach is a little different however. I have started by using newer versions of TPLs and then moving backwards if I encounter problems like API changes, etc.

That is a viable and I think faster approach. I think most of the issues I encountered happened because I was trying to reproduce the standard environment exactly. I'm still dreading the NaLu compilation.

I have created package files for yaml 0.5.3 and for Nalu and I can commit them to my fork of Spack if you would like them.

Yes, that would be nice, I think we should do PR for upstream. At the moment, I keep yaml-cpp external to spack and run the following command in Dockerfile_tpls:

RUN export YAML_URL="https://github.com/jbeder/yaml-cpp" && \
    export YAML_HOME=${HOME}/yaml-cpp && \
    export YAML_SOURCE=${YAML_HOME}/source && \
    export YAML_BUILD=${YAML_HOME}/build && \
    export YAML_INSTALL=${HOME}/opt/yaml-cpp && \
    git clone --depth=1 ${YAML_URL} ${YAML_SOURCE} && \
    mkdir -p ${YAML_BUILD} && \
    cd ${YAML_BUILD} && \
    source ${HOME}/.bashrc && \
    spack load cmake && \
    spack load openmpi && \
    cmake -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_CXX_FLAGS=-std=c++11 -DCMAKE_INSTALL_PREFIX=${YAML_INSTALL} ${YAML_SOURCE} && \
    make && \
    make install && \
    rm -rf ${YAML_HOME}
jrood-nrel commented 7 years ago

I have committed my Nalu, yaml, and custom Trilinos package files as they are right now to my Spack repo. I can't claim they work at all, but they are what I am mainly developing at the moment.

jrood-nrel commented 7 years ago

I believe I have Nalu building with Spack and some of the tests pass, but some of them look like they can't find input files or something and it may be I have my testing environment not complete since Nalu does not have a 'make install' process and I am installing it manually and then modifying the test scripts, but I am assuming my binaries work from this:

Rtest Begin: 
..quad9HC..................... PASSED:  37.7  s
..steadyTaylorVortex.......... PASSED:  64.6  s
ERROR: Unable to open input exodus decomposed database files:
    ekmanSpiral.g.2.0
    ekmanSpiral.g.2.0

My install script looks like this at the moment for our Peregrine machine, but I will encode most of it into the Nalu package file as dependencies and specs. I erased everything and ran this script and ended up with the build that gave me the test output above:

#!/bin/bash -l
set -e
#spack install binutils@2.27
#yes | spack module refresh
spack load binutils
#module list
spack install cmake@3.6.1
spack install libxml2@2.9.4
#spack install superlu@5.2.1 #need to add build capability of superlu 4.3 to spack
spack install hdf5+mpi+cxx@1.8.16 ^openmpi+tm@1.10.3
spack install boost+mpi@1.62.0 ^openmpi+tm@1.10.3
spack install netcdf@4.4.1 ^hdf5+mpi+cxx@1.8.16 ^openmpi+tm@1.10.3
spack install parallel-netcdf@1.7.0 ^openmpi+tm@1.10.3
spack install yaml-cpp@0.5.3 ^openmpi+tm@1.10.3
spack install trilinos@master~hypre~mumps~suite-sparse~superlu-dist~metis+superlu+boost+hdf5@12.8.1 ^boost+mpi@1.62.0 ^hdf5+mpi+cxx@1.8.16 ^openmpi+tm@1.10.3
spack install nalu ^trilinos@master~hypre~mumps~suite-sparse~superlu-dist~metis+superlu+boost+hdf5@12.8.1 ^boost+mpi@1.62.0 ^hdf5+mpi+cxx@1.8.16 ^openmpi+tm@1.10.3

Things I still need to do are add the ability to build Superlu 4.3 to Spack since 5.2.1 breaks the API, and generalize the Yaml, Trilinos, and Nalu package files a bit better. Then I will organize everything I have done to make it usable at NREL first and I can try to help with generalizing it more for other facilities after that.

My spack repo is fully up-to-date with where I'm at for the moment regarding the yaml-cpp, trilinos, and nalu package files if you want to reference them.

jrood-nrel commented 7 years ago

Quick note. I have a pull request to add my SuperLU package file to Spack which builds version 4.3 required for Trilinos. Next I will submit my yaml-cpp package file. Then I will only need a custom Trilinos package file and Nalu package file. I haven't decided whether or not to try to get a Nalu package file approved since it depends on such a customized Trilinos build.

aprokop commented 7 years ago

@jrood-nrel Good job, your SuperLU (llnl/spack#2390) and yaml-cpp (llnl/spack#2399) packages been merged to the spack master.

In the meantime, I've been able to compile Trilinos and Nalu using the older versions of the packages. I've been able to do a shared build with static versions of TPLs (which required some updates to spack packages to support "-fPIC"). I've also examined the shared versions of TPLs but at the moment it's getting stuck on the fact that parallel-netcdf does not allow for a shared build, and because of that netcdf shared also fails to build. I'm going to update my hdf5 and zlib spack PRs now that the pic_flag compilers option in spack has been merged, and hopefully finish dealing with spack that way.

The current versions of docker containers is available online. One can do a docker pull aprokop/nalu_tpls to get the container. I have not currently put the configuration scripts in, but will do it at the first opportunity.

I'm not sure where to put the configuration scripts. I feel that it may be beneficial to have NaLu/docker or something similar, in case people want to rebuild the images.

spdomin commented 7 years ago

@NaluCFD/core, let's talk about the Docker/Spack path forward in a stand-up tomorrow.

jrood-nrel commented 7 years ago

I joined the stand-up late and it sounded like the time was taken up by other topics so I didn't want to make the call go long, so maybe we can discuss this in another stand-up. Sorry about that.

Anyway, I have some documentation at https://github.com/jrood-nrel/NaluSpack that describes my process for installing Nalu with Spack on NREL's Peregrine machine. These instructions should be quite relevant to other sites as well. When using the official Spack repo, I have only our own custom nalu and nalu-trilinos package files that are necessary for Nalu that need to be copied into Spack. All its dependency requirements (at least beginning with NREL) have been pushed into the official Spack repo. The most difficult thing I feel going forward would be getting the Spack compiler.yaml and packages.yaml configuration files ironed out for your particular machine. Though once that is taken care of, I think the ability to build through a package manager is very useful going forward for people new to the project, and for project members who may find themselves with access to new machines, i.e. blank slates, and to also be able to quickly test options in dependencies that may require rebuilding, and to be able to test newer versions of dependencies quickly as well so that the software stack doesn't get stale.

Next, I think it would be great to have a 'make install' option for Nalu for portability/standardization in its future distribution. However, I was able to check out the Nalu repo and copy my Spack-generated binaries to the build directory and ran the tests that way, so if it would involve a large effort with the way it is organized, this was not that much effort:

Rtest Begin: ..periodic3dElemNp1........... PASSED: 6.1 s ..periodic3dElemNp4........... PASSED: 2.5 s ..periodic3dElemNp8........... PASSED: 1.6 s ..periodic3dEdgeNp1........... PASSED: 2.4 s ..periodic3dEdgeNp4........... PASSED: 1.4 s ..periodic3dEdgeNp8........... PASSED: 1.1 s ..quad9HC..................... PASSED: 36.4 s ..steadyTaylorVortex.......... PASSED: 64.6 s ...

Next I will try building with the Intel compiler and Marc and I are working to implement the Spack build to hopefully simplify some scripts underneath the CDash tests.

After that I will move on to other things, but I would always be very interested in helping to implement the automatization of building the Nalu software stack at any other sites.

spdomin commented 7 years ago

@jrood-nrel, if you are up to it, let's have you drive a parking lot discussion on using Spack for a sample build. Let's make sure that you are able to share your screen...

aprokop commented 7 years ago

@spdomin Good idea, I'd like to participate in that.

jrood-nrel commented 7 years ago

@spdomin @aprokop Sure, I think it would be helpful to give a live tutorial on using Spack in general first and then I can have a video prepared of the entire process I use to build Nalu where I can cut out the build times. Maybe 20 minutes in total?

spdomin commented 7 years ago

I may not have 20 minutes at the next stand-up. Perhaps a dedicated meeting might be in order? We can chat about it at the next stand-up.

jrood-nrel commented 7 years ago

No stress. I have something prepared whenever it makes sense to go over it and we can talk about it at the stand-up.

aprokop commented 7 years ago

I had a brief chat with a few local folks managing Titan and their experiences with Spack. In summary: it's complicated due to different machines having different custom Cray environments. They have some code that has not been pushed to mainline yet. But the good news is that we have local resources (at least in ORNL) to help if we want to run on Titan/Summit using Spack.

aprokop commented 7 years ago

My docker scripts and Spack configs are now available here.

michaelasprague commented 7 years ago

@aprokop @jrood-nrel Can we close this issue, given both your efforts and the current status of the Spack implementation?

jrood-nrel commented 7 years ago

That's fine with me.