Make sure code runs in ARCHER

raquelalegre commented 8 years ago

[x] Compile and test in login node.
[x] Test as a job in compute nodes.

raquelalegre commented 8 years ago

James found the project code - eCSE0506. It doesn't appear in the list of projects in SAFE to request a new login to try the code, though. James is chasing Chris Johnson from EPCC about this.

raquelalegre commented 8 years ago

We all got new accounts for the project now and can get to an ARCHER login node, e.g.:

ssh raquel@login.archer.ac.uk

I'm trying to run HJCFIT in the login node before going any further. It doesn't quite work yet. This is what we've done so far:

Clone the development branch of HJCFIT's repo and create build dir:

git clone -b develop https://github.com/DCPROGS/HJCFIT.git
cd HJCFIT
mkdir build
cd build

Load latest cmake:
```
module load cmake/3.2.3
```
Load swig:
```
module load swig
```
Load python3:
```
anaconda-compute/2.2.0-python3
```
- Note I need to look in ARCHER's documentation if this is the right python to use. We want pytho3.5 for users to multiply matrices using the @ operator instead of np.dot.
Locally install behave for current user:
```
pip install --user behave
```
- Might also need exporting the user's local install folder to PATH
  - export PATH=$HOME/.local/bin/:$PATH

Load cray c++11 compiler:

module unload PrgEnv-cray
module load PrgEnv-gnu/5.2.56

Clean build folder:
```
git clean -xdf
rm -rf external
```
Run Cmake:
```
cmake ..
```
- Note it's successful although the log shows something odd is happening with Numpy (should be TRUE-TRUE):
```
-- [NumPy] NPY_ARRAY_* macros exist = FALSE
-- [NumPy] PyArray_ENABLEFLAGS exists = FALSE
```

Run make:

make

Breaks:

[ 69%] Building CXX object likelihood/tests/CMakeFiles/test_asymptotes.dir/asymptotes.cc.o
Linking CXX executable ../test_asymptotes
/usr/bin/ld: attempted static link of dynamic object `../liblikelihood.so'
collect2: error: ld returned 1 exit status
make[2]: *** [likelihood/test_asymptotes] Error 1
make[1]: *** [likelihood/tests/CMakeFiles/test_asymptotes.dir/all] Error 2
make: *** [all] Error 2

raquelalegre commented 8 years ago

For configuring a job to run the tests in ARCHER's nodes we can look into the stuff Jens did to run Zacros in ARCHER. I'm assuming is this kind of stuff @jenshnielsen?

raquelalegre commented 8 years ago

The code runs now in an ARCHER login node. The required modules and environment configuration has been added in a new repo branch, see PR #82. This will need to be run before attempting to compile the code with cmake in the login nodes.

Once the code is compiled, we should be able to execute it as part of an ARCHER job, which is what we will be testing next. Everything seems to work in the login nodes, but that's no guarantee it'll do in the compute nodes as well.

raquelalegre commented 8 years ago

Running the test fails in the jobs and it's unclear why, I'm working on it at the moment. Jens showed me there's a way to run the job interactively in ARCHER, which is much faster than submitting the jobs to a queue. I'll add this to the documentation in the wiki, but just so I keep track of it, I'll paste it here:

qsub -q short -IVl select=1,walltime=0:5:0 -A ecse0506

-q short: There is a short queue that runs 9am-5pm that will accept interactive jobs that need less than 8 nodes and less than 20 minutes to be run.
-I indicates the job is interactive.
-V exports the user's environment (I think it runs ~/.bashrc)
-l followed by resource list:
- select=1 indicates one node will be used
- walltime:0:10:0 indicates 10 minutes of time available for our job
-A ecse0506 followed by project code indicates the budget the time/resources allocates should go to.

Once the command is run, we'll be redirected to the home directory of one of the MOM nodes. This will let us run aprun commands and test HJCFIT without having to wait until they go through the standard queue.

raquelalegre commented 8 years ago

I could see what the problem mentioned earlier was by logging in the mom nodes via short interactive queue and running cmake with aprun. It turns out the version of cmake available in the mom nodes and login nodes is 3.2.3, but the compute nodes where I was trying to run cmake are not meant for that (they have a lighter OS that only has cmake 2.6).

The right way of doing this is by building HJCFIT code in my $WORK folder, then making the job run the commands for the individual tests with aprun, this way:

# C++ tests
aprun -n 1 /work/ecse0506/ecse0506/raquel/HJCFIT/build/likelihood/test_qmatrix
aprun -n 1 /work/ecse0506/ecse0506/raquel/HJCFIT/build/documentation/doc_cxx_approx_survivor

# Python test (won't work in the compute nodes because home is not visible, also will need a vent
aprun -n 1 /home/ecse0506/ecse0506/raquel/.local/bin/behave "/work/ecse0506/ecse0506/raquel/HJCFIT/likelihood/python/features/approx_survivor.feature" "--junit" "--junit-directory" "/work/ecse0506/ecse0506/raquel/HJCFIT/build/test-results/" "-q"
aprun -n 1 /work/y07/y07/cse/anaconda/2.2.0-python3/bin/python "/home/ecse0506/ecse0506/raquel/HJCFIT/documentation/code/approx_survivor.py"

Note this might need reviewing what python module to use in the compute nodes. Behave is installed in my HOME folder which is not visible from the compute nodes either, should be in WORK or maybe loaded from the job.

If this works, then we'll have to change CMake so that it uses this procedure instead.

raquelalegre commented 8 years ago

It works for the C++ tests, need to check the Python ones.

raquelalegre commented 8 years ago

Here's ARCHER's docu on how to work on Python in the compute nodes.

To work on login nodes, we have used anaconda, but for the compute nodes this is not allowed since anaconda will install everything in the /home filesystem, which is not visible from the compute nodes. We need to use a native python distribution instead. (It is not entirely imposible to use anaconda in the compute nodes, but it's not recommended and it's not optimised).

To load it: module load python-compute. This will load several packages (to see which are available do module load pc-+tab), but doesn't include behave which we need for the tests.

To add packages to the native distribution, we'll need to create a virtual environment and pip install them.

I'm going to try that now and see what happens.

raquelalegre commented 8 years ago

Finally managed to make the python stuff work in ARCHER.

I have put together a bash script to create a virtual environment with Anaconda since that is the only way python tasks will work in the compute nodes. This script sets up a couple environmental variables that are needed:

To create the virtual env in $WORK instead of its default $HOME so that it's visible from the compute nodes, we need to create $CONDA_ENVS_PATH and point it to a manually created folder. This will be passed to conda create using the -p flag instead of -n.
We also need to make sure pip install --user behave puts behave in the right place by setting up $PYTHONUSERBASE pointing to the virtual environment's path.

Once this is done, we have to build and install HJCFIT as always, making sure the python packages are being installed in the virtual env, i.e. $CONDA_ENV_PATH/lib/python2.7/site-packages.

Now we can run the python tests as part as jobs. There is a sample job file here that runs all the c++ and python tests in the compute nodes.

Note we can't use the Anaconda module I was using for the previous tests in the login nodes because the MOM nodes don't have it, so we have to use anaconda-compute/2.2.0-python2 instead. Also ARCHER's documentation claims the anaconda modules are not optimised to work in the compute nodes, so we'll have to keep an eye on performance and maybe switch to the python-compute modules instead. Note we discarded that option because installing scipy is a bit of a nightmare.

I'll put this stuff in the wiki.

jenshnielsen commented 8 years ago

I don't really understand why we need to set $PYTHONUSERBASE if pip is installed into the conda env then surely pip install behave should install into that env? Otherwise that sounds like a bug in conda

raquelalegre commented 8 years ago

It was trying to install it in the main conda installation and complaining about permissions. I didn't find it very logical either. I can check again that's the case, just to make sure.

DCPROGS / HJCFIT

Make sure code runs in ARCHER #65