Closed raquelalegre closed 8 years ago
James found the project code - eCSE0506. It doesn't appear in the list of projects in SAFE to request a new login to try the code, though. James is chasing Chris Johnson from EPCC about this.
We all got new accounts for the project now and can get to an ARCHER login node, e.g.:
ssh raquel@login.archer.ac.uk
I'm trying to run HJCFIT in the login node before going any further. It doesn't quite work yet. This is what we've done so far:
Clone the development branch of HJCFIT's repo and create build dir:
git clone -b develop https://github.com/DCPROGS/HJCFIT.git
cd HJCFIT
mkdir build
cd build
Load latest cmake:
module load cmake/3.2.3
Load swig:
module load swig
Load python3:
anaconda-compute/2.2.0-python3
@
operator instead of np.dot
. Locally install behave for current user:
pip install --user behave
export PATH=$HOME/.local/bin/:$PATH
Load cray c++11 compiler:
module unload PrgEnv-cray
module load PrgEnv-gnu/5.2.56
Clean build folder:
git clean -xdf
rm -rf external
Run Cmake:
cmake ..
-- [NumPy] NPY_ARRAY_* macros exist = FALSE
-- [NumPy] PyArray_ENABLEFLAGS exists = FALSE
Run make:
make
[ 69%] Building CXX object likelihood/tests/CMakeFiles/test_asymptotes.dir/asymptotes.cc.o
Linking CXX executable ../test_asymptotes
/usr/bin/ld: attempted static link of dynamic object `../liblikelihood.so'
collect2: error: ld returned 1 exit status
make[2]: *** [likelihood/test_asymptotes] Error 1
make[1]: *** [likelihood/tests/CMakeFiles/test_asymptotes.dir/all] Error 2
make: *** [all] Error 2
For configuring a job to run the tests in ARCHER's nodes we can look into the stuff Jens did to run Zacros in ARCHER. I'm assuming is this kind of stuff @jenshnielsen?
The code runs now in an ARCHER login node. The required modules and environment configuration has been added in a new repo branch, see PR #82. This will need to be run before attempting to compile the code with cmake in the login nodes.
Once the code is compiled, we should be able to execute it as part of an ARCHER job, which is what we will be testing next. Everything seems to work in the login nodes, but that's no guarantee it'll do in the compute nodes as well.
Running the test fails in the jobs and it's unclear why, I'm working on it at the moment. Jens showed me there's a way to run the job interactively in ARCHER, which is much faster than submitting the jobs to a queue. I'll add this to the documentation in the wiki, but just so I keep track of it, I'll paste it here:
qsub -q short -IVl select=1,walltime=0:5:0 -A ecse0506
-q short
: There is a short
queue that runs 9am-5pm that will accept interactive jobs that need less than 8 nodes and less than 20 minutes to be run. -I
indicates the job is interactive.-V
exports the user's environment (I think it runs ~/.bashrc)-l
followed by resource list:
select=1
indicates one node will be usedwalltime:0:10:0
indicates 10 minutes of time available for our job-A ecse0506
followed by project code indicates the budget the time/resources allocates should go to.Once the command is run, we'll be redirected to the home directory of one of the MOM nodes. This will let us run aprun
commands and test HJCFIT without having to wait until they go through the standard queue.
I could see what the problem mentioned earlier was by logging in the mom nodes via short interactive queue and running cmake
with aprun
. It turns out the version of cmake
available in the mom nodes and login nodes is 3.2.3, but the compute nodes where I was trying to run cmake
are not meant for that (they have a lighter OS that only has cmake 2.6).
The right way of doing this is by building HJCFIT code in my $WORK folder, then making the job run the commands for the individual tests with aprun
, this way:
# C++ tests
aprun -n 1 /work/ecse0506/ecse0506/raquel/HJCFIT/build/likelihood/test_qmatrix
aprun -n 1 /work/ecse0506/ecse0506/raquel/HJCFIT/build/documentation/doc_cxx_approx_survivor
# Python test (won't work in the compute nodes because home is not visible, also will need a vent
aprun -n 1 /home/ecse0506/ecse0506/raquel/.local/bin/behave "/work/ecse0506/ecse0506/raquel/HJCFIT/likelihood/python/features/approx_survivor.feature" "--junit" "--junit-directory" "/work/ecse0506/ecse0506/raquel/HJCFIT/build/test-results/" "-q"
aprun -n 1 /work/y07/y07/cse/anaconda/2.2.0-python3/bin/python "/home/ecse0506/ecse0506/raquel/HJCFIT/documentation/code/approx_survivor.py"
Note this might need reviewing what python module to use in the compute nodes. Behave is installed in my HOME folder which is not visible from the compute nodes either, should be in WORK or maybe loaded from the job.
If this works, then we'll have to change CMake so that it uses this procedure instead.
It works for the C++ tests, need to check the Python ones.
Here's ARCHER's docu on how to work on Python in the compute nodes.
To work on login nodes, we have used anaconda, but for the compute nodes this is not allowed since anaconda will install everything in the /home filesystem, which is not visible from the compute nodes. We need to use a native python distribution instead. (It is not entirely imposible to use anaconda in the compute nodes, but it's not recommended and it's not optimised).
To load it: module load python-compute
. This will load several packages (to see which are available do module load pc-
+tab), but doesn't include behave which we need for the tests.
To add packages to the native distribution, we'll need to create a virtual environment and pip install them.
I'm going to try that now and see what happens.
Finally managed to make the python stuff work in ARCHER.
I have put together a bash script to create a virtual environment with Anaconda since that is the only way python tasks will work in the compute nodes. This script sets up a couple environmental variables that are needed:
$WORK
instead of its default $HOME
so that it's visible from the compute nodes, we need to create $CONDA_ENVS_PATH
and point it to a manually created folder. This will be passed to conda create
using the -p
flag instead of -n
. pip install --user behave
puts behave
in the right place by setting up $PYTHONUSERBASE
pointing to the virtual environment's path.Once this is done, we have to build and install HJCFIT as always, making sure the python packages are being installed in the virtual env, i.e. $CONDA_ENV_PATH/lib/python2.7/site-packages
.
Now we can run the python tests as part as jobs. There is a sample job file here that runs all the c++ and python tests in the compute nodes.
Note we can't use the Anaconda module I was using for the previous tests in the login nodes because the MOM nodes don't have it, so we have to use anaconda-compute/2.2.0-python2
instead. Also ARCHER's documentation claims the anaconda modules are not optimised to work in the compute nodes, so we'll have to keep an eye on performance and maybe switch to the python-compute modules instead. Note we discarded that option because installing scipy
is a bit of a nightmare.
I'll put this stuff in the wiki.
I don't really understand why we need to set $PYTHONUSERBASE
if pip is installed into the conda env then surely pip install behave
should install into that env? Otherwise that sounds like a bug in conda
It was trying to install it in the main conda installation and complaining about permissions. I didn't find it very logical either. I can check again that's the case, just to make sure.