erdc / proteus

A computational methods and simulation toolkit
http://proteustoolkit.org
MIT License
88 stars 56 forks source link

Installation on copper - errors #338

Closed adimako closed 8 years ago

adimako commented 8 years ago

@cekees @tridelat as discussed we get the following error while building scipy in copper. At some point it says to recompile using -fPIC option, which we have done but we still get the same error

'[scipy] scipy/integrate/quadpack.h:804: warning: call to function 'dqawce_' without a real prototype [scipy] scipy/integrate/_quadpack.h:60: note: 'dqawce' was declared here [scipy] scipy/integrate/quadpack.h: In function 'quadpack_qawse': [scipy] scipy/integrate/quadpack.h:884: warning: call to function 'dqawse_' without a real prototype [scipy] scipy/integrate/_quadpack.h:59: note: 'dqawse' was declared here [scipy] scipy/integrate/quadpack.h:891: warning: call to function 'dqawse_' without a real prototype [scipy] scipy/integrate/_quadpack.h:59: note: 'dqawse' was declared here [scipy] /usr/bin/gfortran -Wall -fPIC -shared -Wl,-rpath=/lustre/usr/local/u/adimako/.hashdist/bld/python/vgl7ugq3lahf/lib -Wl,-rpath=/opt/acml/5.3.1/gfortran64/lib -L/lustre/usr/local/u/adimako/.hashdist/bld/python/vgl7ugq3lahf/lib/python2.7/config -lpython2.7 -lpthread -ldl -lutil -lm -Xlinker -export-dynamic build/temp.linux-x86_64-2.7/scipy/integrate/_quadpackmodule.o -Lbuild/temp.linux-x86_64-2.7 -lquadpack -llinpack_lite -lmach -lgfortran -o build/lib.linux-x86_64-2.7/scipy/integrate/_quadpack.so [scipy] /usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/4.3/libgfortran.a(stop.o): relocation R_X86_64_32 against .rodata.str1.1' can not be used when making a shared object; recompile with -fPIC [scipy] /usr/lib64/gcc/x86_64-suse-linux/4.3/libgfortran.a: could not read symbols: Bad value [scipy] collect2: ld returned 1 exit status [scipy] /usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/4.3/libgfortran.a(stop.o): relocation R_X86_64_32 against.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC [scipy] /usr/lib64/gcc/x86_64-suse-linux/4.3/libgfortran.a: could not read symbols: Bad value [scipy] collect2: ld returned 1 exit status [scipy] error: Command "/usr/bin/gfortran -Wall -fPIC -shared -Wl,-rpath=/lustre/usr/local/u/adimako/.hashdist/bld/python/vgl7ugq3lahf/lib -Wl,-rpath=/opt/acml/5.3.1/gfortran64/lib -L/lustre/usr/local/u/adimako/.hashdist/bld/python/vgl7ugq3lahf/lib/python2.7/config -lpython2.7 -lpthread -ldl -lutil -lm -Xlinker -export-dynamic build/temp.linux-x86_64-2.7/scipy/integrate/_quadpackmodule.o -Lbuild/temp.linux-x86_64-2.7 -lquadpack -llinpack_lite -lmach -lgfortran -o build/lib.linux-x86_64-2.7/scipy/integrate/_quadpack.so" failed with exit status 1 [scipy|ERROR] Command '[u'/bin/bash', '_hashdist/build.sh']' returned non-zero exit status 1 [scipy|ERROR] command failed (code=1); raising make: *\ [/u/adimako/proteus/garnet.gnu/artifact.json] Error 127'

cekees commented 8 years ago

What is the commit ID of the stack?

adimako commented 8 years ago

PROTEUS : /u/adimako/proteus PROTEUS_ARCH : garnet.gnu PROTEUS_PREFIX : /u/adimako/proteus/garnet.gnu PROTEUS_VERSION : f9f37d0de2834eb038171fb47d0a854de443f9c9 HASHDIST_VERSION : 71d335be9ee04e3cc9a9df92a9348a2d8e3ed607 HASHSTACK_VERSION: ed55f4e10f07eb0b85fa6a0d15f4d0e5104902c0

cekees commented 8 years ago

I think you need to pull the stable/copper branches of both proteus and hashdist unless you've done a merge locally. The latest commits on those branches are Proteus https://github.com/erdc-cm/proteus/commit/4bd41f16fda9783ead3ece830ad4ea459be82990 https://github.com/hashdist/hashstack/commit/4e8a64e519678664a917b0bda4132cf18437f593

adimako commented 8 years ago

@cekees this solved things, thanks. So the packages are now installed. However, during compilation of proteus I get the following errors

cd stack && /u/adimako/proteus/hashdist/bin/hit develop  -v -f -k error default.yaml /u/adimako/proteus/garnet.gnu
launcher:Unable to launch '/lustre/usr/local/u/adimako/proteus/garnet.gnu/bin/../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/python2.7' (No such file or directory)make: *** [/u/adimako/proteus/garnet.gnu/artifact.json] Error 127

Which I have seen before and I think it is caused by broken links in the $PROTEUS_ARCH. Listing the files in garnet.gnu/bin folder:

drwxr----- 2 adimako 0089JR40  4096 Feb 18 11:51 .
drwxr----- 6 adimako 0089JR40  4096 Feb 18 11:51 ..
lrwxrwxrwx 1 adimako 0089JR40    84 Feb 18 11:51 2to3 -> ../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/2to3
lrwxrwxrwx 1 adimako 0089JR40    84 Feb 18 11:51 idle -> ../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/idle
-rwxr-xr-x 1 adimako 0089JR40 10552 Feb 18 11:51 launcher
lrwxrwxrwx 1 adimako 0089JR40    85 Feb 18 11:51 pydoc -> ../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/pydoc
lrwxrwxrwx 1 adimako 0089JR40     7 Feb 18 11:51 python -> python2
lrwxrwxrwx 1 adimako 0089JR40     9 Feb 18 11:51 python2 -> python2.7
lrwxrwxrwx 1 adimako 0089JR40     8 Feb 18 11:51 python2.7 -> launcher
lrwxrwxrwx 1 adimako 0089JR40    96 Feb 18 11:51 python2.7-config -> ../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/python2.7-config
-rw-r----- 1 adimako 0089JR40    89 Feb 18 11:51 python2.7.link
lrwxrwxrwx 1 adimako 0089JR40    16 Feb 18 11:51 python2-config -> python2.7-config
lrwxrwxrwx 1 adimako 0089JR40    14 Feb 18 11:51 python-config -> python2-config
lrwxrwxrwx 1 adimako 0089JR40    88 Feb 18 11:51 smtpd.py -> ../../../../../lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/smtpd.py

Most of these links are broken as the paths do not exist. I could fix them, but I am not sure how launcher works. I will look at the README file in hashdist

adimako commented 8 years ago

I think it is because in the / folder there is a link /u to the folder /usr/local/u and this causes the confusion. I have seen similar issues in hydra as well. I will now hardcode the links manually, but we should find a more consistent solution to this

adimako commented 8 years ago

If I remove the garnet.gnu folder, I get an error from hit command saying

[ERROR] [Errno 17] File exists in silent_absolute_symlink('/lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/2to3', u'/u/adimako/proteus/garnet.gnu/bin/2to3')
[profile|ERROR] hit command failed: [Errno 17] [Errno 17] File exists in silent_absolute_symlink('/lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/2to3', u'/u/adimako/proteus/garnet.gnu/bin/2to3')
cekees commented 8 years ago

Meaning you did make distclean and then make develop and you now get that error? What is the output of echo $HOME?

adimako commented 8 years ago

Yep, setting $HOME to the actual path, not the symbolik link works. I think hashdist picks up that something is wrong: File exists in silent_absolute_symlink('/lustre/usr/local/u/adimako/.hashdist/bld/python/3oefiwa4r63i/bin/2to3', u'/u/adimako/proteus/garnet.gnu/bin/2to3') but then decides to go with a path anyway.

Is there a way to set this manually, e.g. by using an env variable that bypasses the automated procedure?

adimako commented 8 years ago

or maybe it does not pick it up. But it does show two different paths for the same home folder and this may cause the issue

cekees commented 8 years ago

Just to make sure I'm clear: You did something like export HOME=/lustre/usr/local/u/adimako and the build completed successfully? @zhang-alvin was having the same error yesterday on another cluster, but I'm not sure how he resolved it. On his machine we found that inside hashdist we were getting a contradition in this bit of code

 try:
        os.symlink(os.path.abspath(src), dst)
    except OSError:
        if not os.path.exists(dst):
            raise

OSError was being raised with a code equivalent to "File Exists" but os.path.exists(dst) returned false.

adimako commented 8 years ago

Yep, i included the line of code you quoted in .bashrc and it proceeded with the compilation. Now I am trying to find out where the acml library is. (I have loaded the module but cannot find the library in LD_LIBRARY_PATH

cekees commented 8 years ago

Try module help acml or env | grep ACML after you load the module. I believe my LD_LIBRARY_PATH is set to the correct path in ~/.cekees/.cshrc, which you should be able to read.

adimako commented 8 years ago

Now i get an illegal instruction error in partition test. I think it has to do with the first line #!/usr/bin/env python. This runs normally in cmd and starts python in console mode

adimako commented 8 years ago

(acml lib probem solved)

adimako commented 8 years ago

Another thing to note is that I log in by default in bash shell, not is csh/tcsh

cekees commented 8 years ago

Are you running an interactive job on the back end? Basically on these HPC machines no proteus tests will run on the login nodes because the mpi subsystem is disabled. See the qi alias in my .cshrc for how to run an interactive job.

adimako commented 8 years ago

@cekees I see. Maybe it is worth trying with submitting the cases directly to the cluster and see how it goes.

cekees commented 8 years ago

I think these issues have been resolved, right?

adimako commented 8 years ago

@cekees Yes you can close it for now.