geo-fluid-dynamics / phaseflow-fenics

Phaseflow simulates the convection-coupled melting and solidification of phase-change materials.
MIT License
52 stars 23 forks source link

Test on JURECA in time for 14.08.2017 computing application/proposal #29

Closed agzimmerman closed 7 years ago

agzimmerman commented 7 years ago

The deadline for a compute time proposal is 14 August 2017, 17:00 CEST.

It should be easy enough to define a 3D problem that we want to solve, which would require very many degrees of freedom to resolve the phase-change interface.

The hard part is getting FEniCS running on our JURECA test account. For the purposes of the proposal, Docker might be enough; but as of now I don't know if JURECA supports Docker. If they do and if we just use that for the next couple of weeks, then we have until November (when the compute time would begin) to actually build FEniCS.

agzimmerman commented 7 years ago

Note that I haven't actually applied the "important" label to this, because we might choose to focus on the other aspects of the project for now, and to get into the next round of compute time proposals. Maybe I can begin this discussion tomorrow in person with @julekowalski.

Edit: To clarify, I personally like the idea of focusing hard on getting this running on JURECA right now, even though that will distract a bit from preparing for the conferences in September.

agzimmerman commented 7 years ago

Today I e-mailed Ketan for help (per Wendy's direction), and he directed me to e-mail the software support team at, sc@fz-juelich.de, which I did.

As of now, I've managed to clone the necessary fenics repos and to install (with pip) the Python packages; but cmake is failing to find Boost and Eigen3 when trying to build dolfin. These packages might simply not yet exist on the system, but I am pausing this for now).

The build instructions from the FEniCS developers are here: How to build fenics: https://fenicsproject.org/download/

Info about JURECA is here: JURECA info: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/UserInfo/UserInfo_node.html

agzimmerman commented 7 years ago

A backup plan:

Only apply for shared memory, so a single node with many shared memory processors. Then I can just show scaling up to four processors on my desktop.

agzimmerman commented 7 years ago

I might sit at FJZ for a couple of days next week. I'm discussing this with Ketan and Sean. The only technicality is needing a guest pass to enter FZJ.

julekowalski commented 7 years ago

sounds good - go for it if possible .. J

agzimmerman commented 7 years ago

I have made quite some progress with building FEniCS on JURECA.

Presently I have hit a roadblock with installing "swig" (the Simplified Wrapper and Interface Generator, a tool for interfacing Python and C++, among other languages); but I have an open ticket with SC support, and I expect to hear from them on Monday or Tuesday.

agzimmerman commented 7 years ago

SWIG is installed for Python 2.7.1 but not 3+. (Edit: The SC support team installed SWIG for Python 3 per my request, and it can now be loaded with "module load SWIG/.3.0.12-Python-3.6.1")

I fumbled around a bit more, and somehow got CMake to configure everything for Dolfin, including SWIG (for Python 2), with something like the following:

193 module load GCC 194 module load CMake 196 module load Boost 197 module load Eigen/3.3.3 199 module load Python/3.6.1 200 module load SWIG/.3.0.12-Python-2.7.13 205 module load Python/3.6.1 208 module load ParMETIS/4.0.3 220 module load Intel/2016.4.258-GCC-5.4.0 222 module load ParaStationMPI/5.1.9-1 225 module load Python/3.6.1 227 cmake -DEIGEN3_INCLUDE_DIR=/usr/local/software/jureca/Stages/2017a/software/Eigen/3.3.3-GCC-5.4.0/include ..

And now I've installed to a directory on my user account. DCMAKE_INSTALL_PREFIX didn't behave as expected. I had to use "make DESTDIR= install". One of the FEniCS developers is currently exploring this error with me.

Next while building meshr, it was failing to find the dolfin include dir. There is no -DDOLFIN_INCLUDE_DIR option, but I found this which did the trick: https://stackoverflow.com/questions/25849571/adding-include-directories-to-cmake-when-calling-it-from-command-line So I ended up using:

cmake -DGMP_LIBRARIES=/usr/local/software/jureca/Stages/2017a/software/GMP/6.1.2-GCCcore-5.4.0/lib -DMPFR_LIBRARIES=/usr/local/software/jureca/Stages/2017a/software/MPFR/3.1.5-GCC-5.4.0/lib -DMPFR_INCLUDE_DIR=/usr/local/software/jureca/Stages/2017a/software/MPFR/3.1.5-GCC-5.4.0/include -DDOLFIN_DIR=/homeb/paj1726/paj17261/Installed/python3.6/site-packages/dolfin/share/dolfin/cmake -DCMAKE_CXX_FLAGS=-isystem\ /homeb/paj1726/paj17261/Installed/python3.6/site-packages/dolfin/include ..

Presently meshr is failing to link and I've asked the installation channel of the FEniCS Slack team for help. cmake gives the warning

WARNING: Target "mshr" requests linking to directory "/usr/local/software/jureca/Stages/2017a/software/GMP/6.1.2-GCCcore-5.4.0/lib". Targets may link only to libraries. CMake is dropping the item.

(also for MPFR) and then linking fails during "make install".

agzimmerman commented 7 years ago

I have been stuck on trying to include and link Boost properly for a while now. I've asked for help from the FEniCS Slack team, on the installation channel. I've also asked SC Support to simply build fenics for me on JURECA.

Now also I'm trying to start over using my own installation of Anaconda, and using conda install. See https://fenicsproject.org/qa/8679/install-fenics-without-root-privileges and https://anaconda.org/conda-forge/fenics

agzimmerman commented 7 years ago

SC support was able to build FEniCS and successfully test phaseflow! It may take some time for them to build the module to be available to all users, so I will try to follow the build procedure they provided.

Here's the latest message from them:

Hi Alex,

the following two tests failed: tests/test_ghia1982.py::test_ghia1982_steady_lid_driven_cavity_linearized tests/test_stefan_problem.py::test_pci_refinement

everything else passed.

Cheers,

Benedikt Steinbusch Juelich Supercomputing Support Team

agzimmerman commented 7 years ago

Here are the e-mails that cover the successful build procedure on JURECA:

Dear Alexander Zimmerman,

I will ask again to have FEniCS installed centrally via our EasyBuild mechanism. It might take a while though, as our software manager is currently busy preparing the software stack for the new KNL extension of Jureca.

In the mean time I can try to help you with your own installation of FEniCS. I have now gotten to the point where I can "import fenics" in a Python REPL and it does not crash. I installed the packages in the order described here: https://fenicsproject.org/download/ I stuck with Python 2.7, since it seems there is no version of Boost that is compatible with Python 3 on Jureca(?) Anyway, the list of modules I loaded looks like this: Currently Loaded Modules:

1) GCCcore/.5.4.0 (H) 12) imkl/2017.2.174 23) GMP/6.1.2 34) SciPy-Stack/2017a-Python-2.7.13

2) binutils/.2.28 (H) 13) bzip2/.1.0.6 (H) 24) XZ/.5.2.3 (H) 35) Boost/1.63.0-Python-2.7.13

3) StdEnv (H) 14) libreadline/.7.0 (H) 25) libxml2/.2.9.4 (H) 36) CMake/3.7.2

4) ncurses/.6.0 (H) 15) Tcl/.8.6.6 (H) 26) libxslt/.1.1.29 (H) 37) Eigen/3.3.3

5) libevent/.2.1.8 (H) 16) SQLite/.3.17.0 (H) 27) libffi/.3.2.1 (H) 38) pkg-config/.0.29.1 (H)

6) tmux/2.3 17) expat/.2.2.0 (H) 28) libyaml/.0.1.7 (H) 39) zlib/.1.2.11 (H)

7) icc/.2017.2.174-GCC-5.4.0 (H) 18) libpng/.1.6.28 (H) 29) Java/1.8.0_121 40) PCRE/.8.40 (H)

8) ifort/.2017.2.174-GCC-5.4.0 (H) 19) freetype/.2.7.1 (H) 30) PostgreSQL/9.6.2 41) SWIG/.3.0.12-Python-2.7.13 (H)

9) Intel/2017.2.174-GCC-5.4.0 20) fontconfig/.2.12.1 (H) 31) protobuf/.3.3.0 (H) 42) MPFR/3.1.5

10) pscom/.Default (H) 21) X11/20170129 32) gflags/.2.2.0 (H)

11) ParaStationMPI/5.1.9-1 22) Tk/.8.6.6 (H) 33) Python/2.7.13

All of the pure Python packages that are installed simply via pip worked without a problem for me. I did also have to install the Python pyl package via pip. Then for the two cmake packages, my cmake invocations looked like this:

CC=icc CXX=icpc EIGEN3_ROOT=$EBROOTEIGEN/include cmake -DCMAKE_INSTALL_PREFIX=$HOME/opt/dolfin -DDOLFIN_USE_PYTHON3=OFF ..

and

CC=icc CXX=icpc EIGEN3_ROOT=$EBROOTEIGEN/include cmake -DCMAKE_INSTALL_PREFIX=$HOME/opt/mshr -DGMP_LIBRARIES=$EBROOTGMP/lib/libgmp.so -DGMP_INCLUDE_DIR=$EBROOTGMP/include -DMPFR_LIBRARIES=$EBROOTMPFR/lib/libmpfr.so -DMPFR_INCLUDE_DIR=$EBROOTMPFR/include -DDOLFIN_DIR=$HOME/opt/dolfin/share/dolfin/cmake ..

The install prefix can be changed to your taste, of course.

Once these two are installed, I had to extend the PYTHONPATH to find the Python packages included with dolfin and mshr: export PYTHONPATH=$HOME/opt/mshr/lib/python2.7/site-packages:$HOME/opt/dolfin/lib/python2.7/site-packages:$PYTHONPATH

and the LD_LIBRARY_PATH to find the mshr and dolfin libraries:

export LD_LIBRARY_PATH=$HOME/opt/dolfin/lib:$HOME/opt/mshr/lib:$LD_LIBRARY_PATH

And that way I can import the fenics package in Python. I will try to see if it can be used beyond that point and get back to you.

and

Dear Alexander Zimmerman,

I just tested my installation of FEniCS and could successfully run the first tutorial program here:

https://fenicsproject.org/pub/tutorial/html/._ftut1004.html

including the plots. Some things I discovered along the way:

  • The file $HOME/opt/dolfin/share/dolfin/dolfin.conf that is installed with dolfin sets up your environment to use dolfin (at least LD_LIBRARY_PATH, PYTHONPATH, etc.)
  • If you compiled dolfin with CC=icc and CXX=icpc, you also have to set those when using fenics, i.e. probably just export them to the environment permanently.

I hope this helps you with getting FEniCS to work for now. I will get back to you when I know more about a central installation.

agzimmerman commented 7 years ago

I was able to reproduce Benedikt's build procedure, and the same tests passed for me.

There appears to be a bug with dolfin's CMakeLists.txt. If I try to install anywhere other than $HOME/opt, everything gets messed up.

Also it seems like the tests are taking quite a lot longer to run here than within the Docker machine on my laptop. The first thing that comes to mind is that PETSc was not found. UMFPACK is also probably quite important. PETSc is on JURECA; but I failed to configure dolfin to use it. CMake tried running some test that failed, so it didn't use it. UMFPACK does not appear to be installed at all. I think it's a widely used sparse LU solver.

There is at least one other package that will be very important for us: HDF5.

I asked Benedikt how we should proceed with configuring fenics with these other libraries.

agzimmerman commented 7 years ago

Unfortunately we built 2017.2.0.dev0, and until now I have only been using 2016.2.0 (from the Docker image I originally started with). This might be why two tests aren't passing.

The failed tests can be listed with

$python -m pytest --cache-show:

cache/lastfailed contains: {u'tests/test_ghia1982.py::test_ghia1982_steady_lid_driven_cavity_linearized': True, u'tests/test_stefan_problem.py::test_pci_refinement': True}

agzimmerman commented 7 years ago

I was having too much trouble trying to build dolfin-2016.2.0 on JURECA. I saw that the latest stable version of fenics/dolfin is 2017.1.0; so I tested phaseflow-fenics on that (with Docker on my laptop), which worked (i.e. all tests passed). Now I'm trying to build 2017.1.0 on JURECA.

I updated phaseflow-fenics's continuous integration process to always use the latest stable version of fenics, which is now reflected in this repository's master branch as of pull request #37.

agzimmerman commented 7 years ago

fenics 2017.1.0 built and installed without error; but Python fails to import fenics. The error occurs when running dolfin/cpp/la.py which was evidently generated by SWIG.

There was a similar issue documented here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863829

I've asked for help on the fenics-project slack team's installation channel. Jan Blechta replied; but so far just said

AttributeError: 'module' object has no attribute 'cpp'

This is usually hard to debug. What can help is doing some digging with Python debugger pdb to figure out what goes wrong. Also python -v or python -vv might be useful te debug imports.

python -v made it clear that plenty of other things were being imported from cpp just fine.

Digging in with PDB made it clear exactly where the error was occuring:

-bash-4.2$ python -m pdb cpp_import_error.py

/homeb/paj1726/paj17261/build_dolfin_2017.1.0/debug_cpp_import_error/cpp_import_error.py(1)() -> import fenics (Pdb) continue Traceback (most recent call last): File "/usr/local/software/jureca/Stages/2017a/software/SciPy-Stack/2017a-intel-para-2017a-Python-2.7.13/lib/python2.7/pdb.py", line 1314, in main pdb._runscript(mainpyfile) File "/usr/local/software/jureca/Stages/2017a/software/SciPy-Stack/2017a-intel-para-2017a-Python-2.7.13/lib/python2.7/pdb.py", line 1233, in _runscript self.run(statement) File "/usr/local/software/jureca/Stages/2017a/software/SciPy-Stack/2017a-intel-para-2017a-Python-2.7.13/lib/python2.7/bdb.py", line 400, in run exec cmd in globals, locals File "", line 1, in File "cpp_import_error.py", line 1, in import fenics File "/homeb/paj1726/paj17261/opt/dolfin/lib/python2.7/site-packages/fenics/init.py", line 7, in from dolfin import * File "/homeb/paj1726/paj17261/opt/dolfin/lib/python2.7/site-packages/dolfin/init.py", line 17, in from . import cpp File "/homeb/paj1726/paj17261/opt/dolfin/lib/python2.7/site-packages/dolfin/cpp/init.py", line 43, in exec("from . import %s" % module_name) File "", line 1, in File "/homeb/paj1726/paj17261/opt/dolfin/lib/python2.7/site-packages/dolfin/cpp/la.py", line 232, in class LinearAlgebraObject(dolfin.cpp.common.Variable): AttributeError: 'module' object has no attribute 'cpp' Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program /homeb/paj1726/paj17261/opt/dolfin/lib/python2.7/site-packages/dolfin/cpp/la.py(232)() -> class LinearAlgebraObject(dolfin.cpp.common.Variable):

Also I confirmed that dolfin/cpp/la.py was generated using SWIG 3.0.12 with Python 2.7.13. To verify that the proper Python version was used, I had to step into the la.py script with pdb and print _swig_python_version_info before it was deleted.

agzimmerman commented 7 years ago

I didn't mean to close this...

agzimmerman commented 7 years ago

Benedikt built fenics-2017.1.0 (the latest stable version) and installed it in my user home directory; and all but one test passes!

The failed test is "test_ghia1982_steady_lid_driven_cavity_linearized":

bash-4.2$ python -m pytest --cache-show ================================================ test session starts ================================================ platform linux2 -- Python 2.7.13, pytest-3.2.0, py-1.4.34, pluggy-0.4.0 rootdir: /homeb/paj1726/paj17261/fenics/phaseflow-fenics, inifile: cachedir: /homeb/paj1726/paj17261/fenics/phaseflow-fenics/.cache --------------------------------------------------- cache values ---------------------------------------------------- cache/lastfailed contains: {u'tests/test_ghia1982.py::test_ghia1982_steady_lid_driven_cavity_linearized': True}

The other two steady tests pass, and the adaptive space case also uses the linearized form. It is peculiar that this one test fails. Before closing this issue, I should figure out what the relevant differences are between the Docker image's build and our build on JURECA. But for now, I am going to set this aside to focus on writing the proposal, since we have the functionality we need.

agzimmerman commented 7 years ago

Actually I'll open a new issue for investigating the failed test, since it's no longer tied to the compute time proposal.

agzimmerman commented 7 years ago

So far I'm not observing any OpenMP scaling, and tests fail when using MPI.

agzimmerman commented 7 years ago

From the FEnICS Book:

Shared memory parallel computing. Multithreaded assembly for finite element matrices and vectors on shared memory machines is supported using OpenMP. It is activated by setting the number of threads to use via the parameter system. For example, the code parameters["num _ threads"] = 6; instructs DOLFIN to use six threads in the assembly process.

Edit: I can't find this "num_threads" parameters. e.g...

>>> fenics.parameters["num_threads"]

Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/dolfin/cpp/common.py", line 2555, in getitem raise KeyError("'%s'"%key) KeyError: "'num_threads'"

and fenics.info(fenics.parameters, verbose=True) doesn't show any such parameter.

agzimmerman commented 7 years ago

As of his answer from November 2013, multi-threaded matrix assembly does not work with MPI and multi-threaded matrix assembly does not use the cache well (so first a slow down, which has to be compensated by using many threads).

agzimmerman commented 7 years ago

Today I'm calling "no go" on completing this in time for this round of proposals.

This endeavor led to many useful updates to Phaseflow, including pull requests #37, #40, #43, and #50. Now Travis automatically tests Phaseflow both in serial and with MPI ("mpirun -n 2"). Now we can write solutions to HDF5/XDMF (which was necessary for parallel output) and also only write solutions at specific times, since at large scale it will be too costly to output most of the time steps. I'm probably forgetting some other nice features that were captured in the recent pull requests.

I opened a new issue #51 to investigate the lack of OpenMP speedup.

agzimmerman commented 7 years ago

From Benedikt on August 9:

all done! You should have a working installation of FEniCS in your home directory. To use it, you can follow these steps:

1) Load the necessary modules:

$ module restore fenics_deps

2) Load the dolfin configuration

$ source $HOME/opt/dolfin/share/dolfin/dolfin.conf

And that should be it. I tried running your tests and get 10 out of 11 to pass. Please give it a try and let me know if it does not work for you.

agzimmerman commented 7 years ago

And on August 17:

I hope you were able to get some useful results for your compute time proposal. In the mean time our software manager has installed a version of FEniCS. On JURECA, software is installed in stages, which are "frozen" and declared stable twice a year. The FEniCS modules are installed in the unstable Devel stage for the moment, to allow you to test them and provide feedback on whether they are useful to you the way they are. If everything works well, there is a chance of having them move into the next stable stage which should come out around November.

To get access to the Devel stage, you make the module system aware of the alternative stages first

module use /usr/local/software/jureca/OtherStages

and then load a specific stage

module load Stages/Devel-2017a

Now you load a toolchain (compiler and MPI library) and the DOLFIN module which should pull in all of the components and dependencies of FEniCS

module load Intel ParaStationMPI DOLFIN

Currently, there are two installations of FEniCS available, both FEniCS version 2016.2.0, one sitting on top of Python 2.7.13 (this is the default), the other using Python 3.6.1. The Python 2.7.13 version passes all but three of your tests, maybe again due to the version of FEniCS? The Python 3.6.1 version does not seem to work at all right now, maybe you have a better idea of what went wrong with this installation?

I would suggest you try out these installations of FEniCS before the new stage comes out around November and see if all of the features you need are enabled. It also should be possible to upgrade the installation to a more recent version of FEniCS by then, if you are otherwise satisfied with it.

In case you have any feedback on the installation, please post them as a reply to this ticket. The software manager is watching this as well, so he will be informed. We might not always be able to respond immediately, but it should be possible to get the FEniCS installation into good shape by November.

agzimmerman commented 7 years ago

Looks like I stopped documenting this issue, but there have been some changes that I'm documenting now.

The local user installation from August 9 has HDF5 misconfigured. The global installation from August 17 is the wrong version of fenics (it should be 2017.1.0).

agzimmerman commented 7 years ago

Sebastian set up another global installation for me with version 2017.1.0. To load:

module --force purge
module use /usr/local/software/jureca/OtherStages
module load Stages/Devel-2017a
module load intel-para
module load DOLFIN/2017.1.0-Python-2.7.13

Unfortunately fenics fails to import, with a familiar error:

$ python Python 2.7.13 (default, Apr 19 2017, 17:29:43) [GCC Intel(R) C++ gcc 5.4 mode] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import fenics Traceback (most recent call last): File "", line 1, in File "/usr/local/software/jureca/Stages/Devel-2017a/software/DOLFIN/2017.1.0-intel-para-2017a-Python-2.7.13/lib/python2.7/site-packages/fenics/init.py", line 7, in from dolfin import * File "/usr/local/software/jureca/Stages/Devel-2017a/software/DOLFIN/2017.1.0-intel-para-2017a-Python-2.7.13/lib/python2.7/site-packages/dolfin/init.py", line 17, in from . import cpp File "/usr/local/software/jureca/Stages/Devel-2017a/software/DOLFIN/2017.1.0-intel-para-2017a-Python-2.7.13/lib/python2.7/site-packages/dolfin/cpp/init.py", line 43, in exec("from . import %s" % module_name) File "", line 1, in File "/usr/local/software/jureca/Stages/Devel-2017a/software/DOLFIN/2017.1.0-intel-para-2017a-Python-2.7.13/lib/python2.7/site-packages/dolfin/cpp/la.py", line 232, in class LinearAlgebraObject(dolfin.cpp.common.Variable): AttributeError: 'module' object has no attribute 'cpp'