coin-or / CHiPPS-ALPS

This is the Abstract Library for Parallel Search (ALPS), the abstract base layer of the COIN-OR High Performance Parallel Search framework.
Eclipse Public License 1.0
9 stars 8 forks source link

Help building with MPI #35

Open krislock opened 4 months ago

krislock commented 4 months ago

Hello,

My apologies to trouble you with this. I've been trying to compile and run the ALPS example Knap with MPI.

Here are my steps:

  1. wget https://raw.githubusercontent.com/coin-or/coinbrew/master/coinbrew; chmod u+x coinbrew
  2. ./coinbrew --tests none build Alps@master --enable-static --disable-shared --with-mpi-cflags="$(pkg-config --cflags ompi)" --with-mpi-lflags="$(pkg-config --libs ompi)" MPICC=mpicc MPICXX=mpiCC

This results in the error:

configure:18697: checking for library MPI with separate link and compile checks
configure:18823: g++ -c -O2 -DNDEBUG  -I/opt/metis/el8/contrib/openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8/include/openmpi -I/opt/metis/el8/contrib/openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8/include/openmpi/opal/mca/event/libevent2022/libevent -I/opt/metis/el8/contrib/openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8/include/openmpi/opal/mca/event/libevent2022/libevent/include -pthread  conftest.cpp >&5
conftest.cpp:31:10: error: #include expects "FILENAME" or <FILENAME>
   31 | #include "#include "mpi.h""
      |          ^~~~~~~~~~~~~~

Then changed #include "#include "mpi.h"" in the Alps/configure file to #include "mpi.h", and tried again:

  1. rm -fr build/ dist/
  2. ./coinbrew --tests none build Alps@master --enable-static --disable-shared --with-mpi-cflags="$(pkg-config --cflags ompi)" --with-mpi-lflags="$(pkg-config --libs ompi)" MPICC=mpicc MPICXX=mpiCC

Now it successfully builds and installs.

Next, I tried to compile the Knap example.

  1. export LD_LIBRARY_PATH=/home/krislock/coin-or/dist/lib:$LD_LIBRARY_PATH
  2. cd build/Alps/master/examples/Knap/
  3. make

This results in the error:

for file in KnapMain.o KnapModel.o KnapNodeDesc.o KnapParams.o KnapSolution.o KnapTreeNode.o; do bla="$bla `echo $file`"; done; \
g++  -O2 -DNDEBUG  -o knap $bla `PKG_CONFIG_PATH=/home/krislock/coin-or/dist/lib/pkgconfig:/opt/metis/el8/contrib/openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8/lib/pkgconfig pkgconf --libs alps --static`  
KnapMain.o: In function `main.cold':
KnapMain.cpp:(.text.unlikely+0x1c): undefined reference to `vtable for AlpsKnowledgeBrokerSerial'
KnapMain.o: In function `main':
KnapMain.cpp:(.text.startup+0x3c): undefined reference to `vtable for AlpsKnowledgeBrokerSerial'
KnapMain.cpp:(.text.startup+0x46): undefined reference to `AlpsKnowledgeBrokerSerial::initializeSearch(int, char**, AlpsModel&, bool)'
KnapMain.cpp:(.text.startup+0xff): undefined reference to `AlpsKnowledgeBrokerSerial::rootSearch(AlpsTreeNode*)'
KnapMain.cpp:(.text.startup+0x3e0): undefined reference to `vtable for AlpsKnowledgeBrokerSerial'
collect2: error: ld returned 1 exit status
make: *** [Makefile:93: knap] Error 1

This implies that COIN_HAS_MPI is not set to 1. However, the dist/include/coin-or/AlpsConfig.h has #define ALPS_HAS_MPI 1. So I renamed COIN_HAS_MPI to ALPS_HAS_MPI everywhere in the Alps/examples/Knap/KnapMain.cpp file, and tried to compile again.

  1. make clean; make

Now it compiles without error. However, when I run the executable, I get a segmentation fault.

[krislock@metis Knap]$ mpirun -np 2 ./knap -param knap.par 
[metis:681199:0:681199] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)
[metis:681198:0:681198] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)
==== backtrace (tid: 681199) ====
 0 0x000000000004eb50 killpg()  ???:0
 1 0x000000000042e948 AlpsSubTree::AlpsSubTree()  ???:0
 2 0x0000000000413d9e AlpsKnowledgeBroker::AlpsKnowledgeBroker()  ???:0
 3 0x0000000000407c14 main()  ???:0
 4 0x000000000003ad85 __libc_start_main()  ???:0
 5 0x000000000040827e _start()  ???:0
=================================
==== backtrace (tid: 681198) ====
 0 0x000000000004eb50 killpg()  ???:0
 1 0x000000000042e948 AlpsSubTree::AlpsSubTree()  ???:0
 2 0x0000000000413d9e AlpsKnowledgeBroker::AlpsKnowledgeBroker()  ???:0
 3 0x0000000000407c14 main()  ???:0
 4 0x000000000003ad85 __libc_start_main()  ???:0
 5 0x000000000040827e _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node metis exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any help you can give would be greatly appreciated!

Nathan

krislock commented 4 months ago

I was able to get past the segmentation fault by changing AlpsKnowledgeBroker() to AlpsKnowledgeBroker(model) in Alps/src/AlpsKnowledgeBrokerMPI.h as follows:

AlpsKnowledgeBrokerMPI(int argc, 
           char* argv[], 
           AlpsModel& model,
                       bool showBanner = true)
:
AlpsKnowledgeBroker(model) 
{    
    init();
    initializeSearch(argc, argv, model, showBanner);
}

Note that this was using Alps@2.0 which does not have #include "#include "mpi.h"" in Alps/configure.

I built Alps as follows:

  1. export MPIINCDIR=<directory containing mpi.h>
  2. export MPILIB="$(pkg-config --libs ompi)"
  3. export MPICC=mpicc
  4. export MPICXX=mpiCC
  5. ./coinbrew fetch Alps@2.0
  6. Change Alps/src/AlpsKnowledgeBrokerMPI.h as mentioned above.
  7. ./coinbrew --tests none build Alps --enable-static --disable-shared
tkralphs commented 3 months ago

Sorry for the delay and for these issues, I have not been using Alps with MPI in some time. I did try to get BLIS running with MPI fairly recently and was successful eventually, but only with an older version I believe. There is some discussion at https://github.com/coin-or/CHiPPS-BLIS/discussions/10. Anyway, it seems you got it working for now. I can dig into this further if you are still playing with it. Depending on what you're doing exactly, I can recommend a specific version that may work out of the box. The different version are a bit confusing.