ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
710 stars 218 forks source link

Having trouble configuring case001 #1984

Closed 8i161029 closed 7 years ago

8i161029 commented 7 years ago

I just followed install.md to build picongpu. As configuring case001 the terminal showed that:

cmake command: cmake -DCUDA_ARCH=sm_20 -DCMAKE_INSTALL_PREFIX=/home/lai/paramSets/case001 -DPIC_EXTENSION_PATH=/home/lai/paramSets/case001 /home/lai/src/picongpu -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- libSplash supports PARALLEL output -- Could NOT find Freetype (missing: FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS) CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PNGwriter_LIBRARIES linked by target "picongpu" in directory /home/lai/src/picongpu/src/picongpu

-- Configuring incomplete, errors occurred! See also "/home/lai/build/CMakeFiles/CMakeOutput.log". See also "/home/lai/build/CMakeFiles/CMakeError.log".

I have installed CUDA , ADIOS ,and PNGwriter . I don't know why they cannot be found. Please help me finishing it.

ax3l commented 7 years ago

Thank you for your report! :sparkles:

To reproduce the issue can you please provide the following additional information about your system:

ax3l commented 7 years ago

From a first guess into the blue it looks to me like Freetype (a library for wring text in libpng which we don't necessarily need) was available during the compile of PNGwriter but is not available any more while building PIConGPU

8i161029 commented 7 years ago

My system is ubuntu 16.04 cmake version is 3.5.1 g++ version is 4.9.3 I tried to install Freetype and configured again Here is the terminal showed:

cmake command: cmake -DCUDA_ARCH=sm_20 -DCMAKE_INSTALL_PREFIX=/home/lai/paramSets/case001 -DPIC_EXTENSION_PATH=/home/lai/paramSets/case001 /home/lai/src/picongpu -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- libSplash supports PARALLEL output CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PNGwriter_LIBRARIES linked by target "picongpu" in directory /home/lai/src/picongpu/src/picongpu

-- Configuring incomplete, errors occurred! See also "/home/lai/build/CMakeFiles/CMakeOutput.log". See also "/home/lai/build/CMakeFiles/CMakeError.log".

Is the NVML a part of CUDA toolkit? And how to completely install ADIOS and PNGwriter? Sorry about many questions. I am a beginner in linux.

ax3l commented 7 years ago

Thanks for the update!

Is the NVML a part of CUDA toolkit?

Yes it should since CUDA 8.0 but is optional. The actual error you get starts with the line CMake Error: ... for png.

And how to completely install ADIOS [...]?

ADIOS is optional, either. You can use it instead of libSplash (HDF5) but it's also not the problem you face here.

I tried to install Freetype and configured again. And how to completely install [...] PNGwriter?

Where did you install Freetype? In $HOME/src/freetype? It's not necessary, but if you do make sure to point to it properly via environment vars:

export PNGWRITER_ROOT=$HOME/src/pngwriter
export FREETYPE_DIR=$HOME/src/freetype
export LD_LIBRARY_PATH=$PNGWRITER_ROOT/lib:$FREETYPE_DIR/lib:$LD_LIBRARY_PATH

Last question:

Which exact PIConGPU version did you use? Did you apply modifications?

ax3l commented 7 years ago

It looks to me you are running an old version of PIConGPU (which?) which produces a confusing error message on an outdated version of PNGwriter (instead of saying it can't be used/is outdated).

The problem might be that your PNGwriter version is simply outdated (which version of PNGwriter do you use?).

8i161029 commented 7 years ago

I got PNGwriter and PIConGPU by the commands in install.md:

git clone https://github.com/pngwriter/pngwriter.git git clone https://github.com/ComputationalRadiationPhysics/picongpu.git $PICHOME/src/picongpu

Did I install an old version?

As for Freetype, when I reboot ubuntu the launcher and menu disappeared. I tried to open CompizConfig Settings Manager with $ ccsm, but there was

ImportError: /usr/lib/x86_64-linux-gnu/libharfbuzz.so.0: undefined symbol: FT_Reference_Face

I found FT_Reference_Face is a part of Freetype. So I tried to uninstall Freetype and did $ccsm again. Then the problem solved. So I don't have Freetype now.

ax3l commented 7 years ago

Wonderful! Yes, we don't need freetype so you are good to go now! :)

Feel free to open further issues if you encounter any other troubles.

8i161029 commented 7 years ago

Sorry...the above problem solved is launcher and menu disappeared in desktop. I still have the same problem configuring case001.

ax3l commented 7 years ago

Ah, you are running on a plain desktop environment with a single GPU?

Usually running an X-Server and CUDA at the same time is not the greatest combination since both require a lot a GPU memory (especially PIConGPU does take all that is free) and in old X-Server/CUDA combinations the maximum runtime of kernels is limited by the x-server.

You could do the following things:

What kind of GPU are you using? The default examples without modification might be a bit memory hungry (since we work on Tesla Server GPUs). So you might want to reduce simulation size and number of particles per cell a little.

ax3l commented 7 years ago

Since you deinstalled freetype now, remove your install of PNGwriter and install it again to solve that issue.

ax3l commented 7 years ago

@8i161029 did you make progress by re-installing PNGwriter with your new environment?

ax3l commented 7 years ago

otherwise just remove PNGwriter and continue without it, writing HDF5 will be good enough

8i161029 commented 7 years ago

I removed the following directories : /$HOME/lib/pngwriter , /$HOME/src/pngwriter, and /$HOME/pngwriter Then configured again, but the terminal still showed the same message. Did I remove all the pngwriter? Or how to completely remove?

8i161029 commented 7 years ago

And during installing pngwriter, terminal showed that could not found freetype. Is freetype necessary to install pngwriter? Last week I installed freetype and then my desktop disappeared, so I uninstalled it. If it is necessary, how to install it without problem? Thanks!

ax3l commented 7 years ago

I removed the following directories : /$HOME/lib/pngwriter and /$HOME/pngwriter and configured again, but the terminal still showed the same message. Did I remove all the pngwriter? Or how to completely remove?

If you followed our guide, $HOME/lib/pngwriter is the only install. Your PIConGPU configure call should now show pngwriter as not found (but that is not an error) and you should be able to continue. Can you paste it's output here?

And during installing pngwriter, terminal showed that could not found freetype. Is freetype necessary to install pngwriter? Thanks!

No, it's not necessary.

8i161029 commented 7 years ago

Here is the output:

cmake command: cmake -DCUDA_ARCH=sm_20 -DCMAKE_INSTALL_PREFIX=/home/lai/paramSets/case001 -DPIC_EXTENSION_PATH=/home/lai/paramSets/case001 /home/lai/src/picongpu -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- libSplash supports PARALLEL output -- Could NOT find Freetype (missing: FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS) CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PNGwriter_LIBRARIES linked by target "picongpu" in directory /home/lai/src/picongpu/src/picongpu

-- Configuring incomplete, errors occurred! See also "/home/lai/build/CMakeFiles/CMakeOutput.log". See also "/home/lai/build/CMakeFiles/CMakeError.log".

ax3l commented 7 years ago

Can you please post the output of:

echo $PNGWRITER_ROOT
ls $PNGWRITER_ROOT
ls $PNGWRITER_ROOT/lib*

echo $CMAKE_PREFIX_PATH
8i161029 commented 7 years ago

Here is the output:

echo $PNGWRITER_ROOT /home/lai/lib/pngwriter

ls $PNGWRITER_ROOT CHANGELOG.md examples lib make.include.linux src CMakeLists.txt fonts Makefile make.include.osx tests doc include make.include README.md

ls $PNGWRITER_ROOT/lib* libpngwriter.a libpngwriter.so

echo $CMAKE_PREFIX_PATH


Output of the command "echo $CMAKE_PREFIX_PATH" is empty. Thanks.

ax3l commented 7 years ago

the install in $PNGWRITER_ROOT is broken as it also contains the sources.

ls $PNGWRITER_ROOT should only show lib/ and include/.

Please remove the install

rm -rf $PNGWRITER_ROOT

and if you try to install PNGwriter again check the source out in $HOME/src/pngwriter, build it in $HOME/build and install it to $PNGWRITER_ROOT which is $HOME/lib/pngwriter.

8i161029 commented 7 years ago

I used the "rm -rf $PNGWRITER_ROOT" and configured, but failed. So I installed PNGwriter again. Here is the log:

lai@lai-G56JR:~/build$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/lib/pngwriter ~/src/pngwriter
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- Found PNG: /usr/lib/x86_64-linux-gnu/libpng.so (found suitable version "1.2.54", minimum required is "1.2.9") 
-- Could NOT find Freetype (missing:  FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS) 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lai/build
lai@lai-G56JR:~/build$ make install
Scanning dependencies of target pngwriter
[  6%] Building CXX object CMakeFiles/pngwriter.dir/src/pngwriter.cc.o
[ 13%] Linking CXX shared library libpngwriter.so
[ 20%] Built target pngwriter
Scanning dependencies of target pngwriter_static
[ 26%] Building CXX object CMakeFiles/pngwriter_static.dir/src/pngwriter.cc.o
[ 33%] Linking CXX static library libpngwriter.a
[ 33%] Built target pngwriter_static
Scanning dependencies of target pngtest
[ 40%] Building CXX object CMakeFiles/pngtest.dir/examples/pngtest.cc.o
[ 46%] Linking CXX executable pngtest
[ 46%] Built target pngtest
Scanning dependencies of target lyapunov
[ 53%] Building CXX object CMakeFiles/lyapunov.dir/examples/lyapunov.cc.o
[ 60%] Linking CXX executable lyapunov
[ 60%] Built target lyapunov
Scanning dependencies of target diamond
[ 66%] Building CXX object CMakeFiles/diamond.dir/tests/diamond.cc.o
[ 73%] Linking CXX executable diamond
[ 73%] Built target diamond
Scanning dependencies of target blackwhite
[ 80%] Building CXX object CMakeFiles/blackwhite.dir/tests/blackwhite.cc.o
[ 86%] Linking CXX executable blackwhite
[ 86%] Built target blackwhite
Scanning dependencies of target readwrite
[ 93%] Building CXX object CMakeFiles/readwrite.dir/tests/readwrite.cc.o
[100%] Linking CXX executable readwrite
[100%] Built target readwrite
Linking CXX shared library CMakeFiles/CMakeRelink.dir/libpngwriter.so
Install the project...
-- Install configuration: ""
-- Installing: /home/lai/lib/pngwriter/lib/libpngwriter.so
-- Installing: /home/lai/lib/pngwriter/lib/libpngwriter.a
-- Installing: /home/lai/lib/pngwriter/include/pngwriter.h

Then I configured again, but it still had an error. I build PNGwriter in $HOME/build, and also configure case001 in $HOME/build. Is this a problem I do both things in the same directory? Did I do anything wrong? Teach me please. Thank you.

ax3l commented 7 years ago

Is this a problem I do both things in the same directory?

Yes it might be. Did you clean this temporary build directory in between? Just run

lai@lai-G56JR:~/build$ rm -rf ../build/*

before running pic-configure

8i161029 commented 7 years ago

Oh no I never cleaned build directory before I install these optional libraries... I only deleted CMakeCache.txt when I met "the source ... does not match the source ... used to generate cache. I may have to remove these libraries and reinstall them. So before I need to do things in the build directory I should clean the build directory. I ran ~/build$ rm -rf ../build/* and configured. Here is the output:

cmake command: cmake -DCUDA_ARCH=sm_20 -DCMAKE_INSTALL_PREFIX=/home/lai/paramSets/case001 -DPIC_EXTENSION_PATH=/home/lai/paramSets/case001 /home/lai/src/picongpu -- The C compiler identification is GNU 4.9.3 -- The CXX compiler identification is GNU 4.9.3 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Found MPI_C: /usr/local/openmpi/lib/libmpi.so
-- Found MPI_CXX: /usr/local/openmpi/lib/libmpi.so
-- Found ZLIB: /usr/local/lib/libz.so (found version "1.2.11") -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Try OpenMP C flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Try OpenMP CXX flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Found OpenMP: -fopenmp
-- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Using mallocMC from thirdParty/ directory -- Boost version: 1.57.0 -- Found mallocMC: /home/lai/src/picongpu/thirdParty/mallocMC/src (found suitable version "2.2.0", minimum required is "2.2.0")
-- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- Found HDF5: /home/lai/lib/hdf5/lib/libhdf5.so;/usr/local/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.8.14") -- libSplash supports PARALLEL output -- Found Splash: /home/lai/lib/splash/lib/libsplash.a;/home/lai/lib/hdf5/lib/libhdf5.so;/usr/local/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so;/usr/local/openmpi/lib/libmpi.so;/usr/local/openmpi/lib/libmpi.so (found suitable version "1.6.0", minimum required is "1.6.0") -- Found PNG: /usr/lib/x86_64-linux-gnu/libpng.so (found suitable version "1.2.54", minimum required is "1.2.9") -- Could NOT find Freetype (missing: FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS) -- Found PNGwriter: PNGwriter_LIBRARIES-NOTFOUND;/usr/lib/x86_64-linux-gnu/libpng.so;/usr/local/lib/libz.so (found suitable version "0.5.6", minimum required is "0.5.6") CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PNGwriter_LIBRARIES linked by target "picongpu" in directory /home/lai/src/picongpu/src/picongpu -- Configuring incomplete, errors occurred! See also "/home/lai/build/CMakeFiles/CMakeOutput.log". See also "/home/lai/build/CMakeFiles/CMakeError.log".

Please teach me the next step I need to do. Should I install the libraries again? Thank you!

ax3l commented 7 years ago

That's the right way to build two independent projects with cmake (removing all temporary build artifacts) but still did not solve your problem.

Let us circumvent it the hard way now: edit $PICSRC/src/picongpu/CMakeLists.txt can comment out this line but suffixing a #:

# find PNGwriter installation
# find_package(PNGwriter 0.5.6)

You will then not be able to use the png plugin for previews, but HDF5 or ADIOS output is more useful anyway.

Update: a cleaner way disabling a "wrongly found" plugin would be by passing -c"-DCMAKE_DISABLE_FIND_PACKAGE_PNGwriter=TRUE" as an option to your configure call:

# PIConGPU <0.3.0
$PICSRC/configure -c"-DCMAKE_DISABLE_FIND_PACKAGE_PNGwriter=TRUE" ~/paramSets/case001

# PIConGPU >= 0.3.0
pic-configure -c"-DCMAKE_DISABLE_FIND_PACKAGE_PNGwriter=TRUE" ~/paramSets/case001
8i161029 commented 7 years ago

I entered $PICSRC/configure -c"-DCMAKE_DISABLE_FIND_PACKAGE_PNGwriter=TRUE" ~/paramSets/case001 and here is the out put:

cmake command: cmake -DCUDA_ARCH=sm_20 -DCMAKE_INSTALL_PREFIX=/home/lai/paramSets/case001 -DPIC_EXTENSION_PATH=/home/lai/paramSets/case001 -DCMAKE_DISABLE_FIND_PACKAGE_PNGwriter=TRUE /home/lai/src/picongpu -- The C compiler identification is GNU 4.9.3 -- The CXX compiler identification is GNU 4.9.3 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Found MPI_C: /usr/local/openmpi/lib/libmpi.so
-- Found MPI_CXX: /usr/local/openmpi/lib/libmpi.so
-- Found ZLIB: /usr/local/lib/libz.so (found version "1.2.11") -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Try OpenMP C flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Try OpenMP CXX flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Found OpenMP: -fopenmp
-- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Using mallocMC from thirdParty/ directory -- Boost version: 1.57.0 -- Found mallocMC: /home/lai/src/picongpu/thirdParty/mallocMC/src (found suitable version "2.2.0", minimum required is "2.2.0")
-- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- Found HDF5: /home/lai/lib/hdf5/lib/libhdf5.so;/usr/local/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.8.14") -- libSplash supports PARALLEL output -- Found Splash: /home/lai/lib/splash/lib/libsplash.a;/home/lai/lib/hdf5/lib/libhdf5.so;/usr/local/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so;/usr/local/openmpi/lib/libmpi.so;/usr/local/openmpi/lib/libmpi.so (found suitable version "1.6.0", minimum required is "1.6.0") -- Configuring done -- Generating done -- Build files have been written to: /home/lai/build

I thought the previous problem was solved. Then I entered make But there was another error. Here is log. It is a little bit long:

[ 8%] Building NVCC (Device) object build_picongpu/CMakeFiles/picongpu.dir/picongpu_generated_main.cu.o nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). /home/lai/boost_1_57_0/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(75): warning: variable "tmp" was set but never used /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(53): warning: "boost::mpl::aux::type_wrapper operator/(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::aux::type_wrapper )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(54): warning: "boost::mpl::aux::type_wrapper<boost::mpl::pair<Key, T>> operator|(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::next::type )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(55): warning: "char (&operator||(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::aux::type_wrapper ))[boost::mpl::next::type::value]" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(69): warning: "boost::mpl::aux::typewrapper<mpl::void_> operator/(const boost::mpl::m_mask<Key, Base> &, boost::mpl::aux::type_wrapper )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(70): warning: "boost::mpl::aux::typewrapper<mpl::void_> operator|(const boost::mpl::m_mask<Key, Base> &, boost::mpl::x_order_impl<Base, Key>::type )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_57_0/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(75): warning: variable "tmp" was set but never used /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(53): warning: "boost::mpl::aux::type_wrapper operator/(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::aux::type_wrapper )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(54): warning: "boost::mpl::aux::type_wrapper<boost::mpl::pair<Key, T>> operator|(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::next::type )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(55): warning: "char (&operator||(const boost::mpl::m_item<Key, T, Base> &, boost::mpl::aux::type_wrapper ))[boost::mpl::next::type::value]" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(69): warning: "boost::mpl::aux::typewrapper<mpl::void_> operator/(const boost::mpl::m_mask<Key, Base> &, boost::mpl::aux::type_wrapper )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp(70): warning: "boost::mpl::aux::typewrapper<mpl::void_> operator|(const boost::mpl::m_mask<Key, Base> &, boost::mpl::x_order_impl<Base, Key>::type )" declares a non-template function -- add <> to refer to a template instance /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:53:109: warning: friend declaration ‘boost::mpl::aux::type_wrapper boost::mpl::operator/(const boost::mpl::m_item<Key, T, Base>&, boost::mpl::aux::type_wrapper)’ declares a non-template function [-Wnon-template-friend] BOOST_MPL_AUX_MAP_OVERLOAD( aux::type_wrapper, VALUE_BY_KEY, m_item, aux::type_wrapper ); ^ /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:53:109: note: (if this is not what you intended, make sure the function template has already been declared and add <> after the function name here) /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:54:90: warning: friend declaration ‘boost::mpl::aux::type_wrapper<boost::mpl::pair<T1, T2> > boost::mpl::operator|(const boost::mpl::m_item<Key, T, Base>&, boost::mpl::m_item<Key, T, Base>::order)’ declares a non-template function [-Wnon-template-friend] BOOST_MPL_AUX_MAP_OVERLOAD( aux::type_wrapper, ITEM_BY_ORDER, m_item, order ); ^ /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:55:85: warning: friend declaration ‘char (& boost::mpl::operator||(const boost::mpl::m_item<Key, T, Base>&, boost::mpl::aux::type_wrapper))[boost::mpl::m_item<Key, T, Base>::order:: value]’ declares a non-template function [-Wnon-template-friend] BOOST_MPL_AUX_MAP_OVERLOAD( ordertag, ORDER_BY_KEY, m_item, aux::type_wrapper ); ^ /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:69:121: warning: friend declaration ‘boost::mpl::aux::typewrapper<mpl::void_> boost::mpl::operator/(const boost::mpl::m_mask<Key, Base>&, boost::mpl::aux::type_wrapper)’ declares a non-template function [-Wnon-template-friend] BOOST_MPL_AUX_MAP_OVERLOAD( aux::typewrapper<void>, VALUE_BY_KEY, m_mask, aux::type_wrapper ); ^ /home/lai/boost_1_570/boost/mpl/map/aux/item.hpp:70:94: warning: friend declaration ‘boost::mpl::aux::typewrapper<mpl::void_> boost::mpl::operator|(const boost::mpl::m_mask<Key, Base>&, boost::mpl::m_mask<Key, Base>::keyorder)’ declares a non-template function [-Wnon-template-friend] BOOST_MPL_AUX_MAP_OVERLOAD( aux::typewrapper<void>, ITEM_BY_ORDER, m_mask, keyorder ); ^ Scanning dependencies of target picongpu [ 16%] Building CXX object build_picongpu/CMakeFiles/picongpu.dir/particlePatches.cpp.o [ 25%] Building CXX object build_picongpu/CMakeFiles/picongpu.dir/patchReader.cpp.o [ 33%] Building CXX object build_picongpu/CMakeFiles/picongpu.dir/ArgsParser.cpp.o [ 41%] Building CXX object build_picongpu/CMakeFiles/picongpu.dir/stringHelpers.cpp.o [ 50%] Linking CXX executable picongpu /usr/bin/ld: warning: libmpi.so.12, needed by /home/lai/lib/hdf5/lib/libhdf5.so, may conflict with libmpi.so.20 /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)': SerialDataCollector.cpp:(.text._ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb[_ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb]+0x16): undefined reference toompi_mpi_cxx_op_intercept' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Intracomm::Create_graph(int, int const*, int const*, bool) const': SerialDataCollector.cpp:(.text._ZNK3MPI9Intracomm12Create_graphEiPKiS2_b[_ZNK3MPI9Intracomm12Create_graphEiPKiS2_b]+0x38): undefined reference toMPI::Comm::Comm()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Graphcomm::Clone() const': SerialDataCollector.cpp:(.text._ZNK3MPI9Graphcomm5CloneEv[_ZNK3MPI9Graphcomm5CloneEv]+0x35): undefined reference toMPI::Comm::Comm()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Cartcomm::Clone() const': SerialDataCollector.cpp:(.text._ZNK3MPI8Cartcomm5CloneEv[_ZNK3MPI8Cartcomm5CloneEv]+0x35): undefined reference toMPI::Comm::Comm()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Intercomm::Merge(bool) const': SerialDataCollector.cpp:(.text._ZNK3MPI9Intercomm5MergeEb[_ZNK3MPI9Intercomm5MergeEb]+0x35): undefined reference toMPI::Comm::Comm()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o): In function MPI::Intracomm::Create(MPI::Group const&) const': SerialDataCollector.cpp:(.text._ZNK3MPI9Intracomm6CreateERKNS_5GroupE[_ZNK3MPI9Intracomm6CreateERKNS_5GroupE]+0x37): undefined reference toMPI::Comm::Comm()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o):SerialDataCollector.cpp:(.text._ZNK3MPI9Intracomm5SplitEii[_ZNK3MPI9Intracomm5SplitEii]+0x36): more undefined references to MPI::Comm::Comm()' follow /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o):(.rodata._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference toMPI::Datatype::Free()' /home/lai/lib/splash/lib/libsplash.a(SerialDataCollector.cpp.o):(.rodata._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to `MPI::Win::Free()' collect2: error: ld returned 1 exit status build_picongpu/CMakeFiles/picongpu.dir/build.make:215: recipe for target 'build_picongpu/picongpu' failed make[2]: [build_picongpu/picongpu] Error 1 CMakeFiles/Makefile2:90: recipe for target 'build_picongpu/CMakeFiles/picongpu.dir/all' failed make[1]: [build_picongpu/CMakeFiles/picongpu.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2

I think the error maybe start from [ 50%] Linking executable picongpu. Did anything I wrongly install? Or I didn't install something necessary. Thank you for help me a lot.

ax3l commented 7 years ago

ok we are getting there ;)

it looks to me like something is wrong with your HDF5 install. Did you take care to --enable-parallel --enable-shared in HDF5 configure?

On ubuntu, just remove $HOME/lib/hdf5 (and unset the environment hints in bash for HDF5), apt-get install libopenmpi-dev libhdf5-openmpi-dev and compile & install libSplash again.

8i161029 commented 7 years ago

Thank you! I think I installed successfully. Here is log:

-- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.5") -- Compiling as C++11... -- Debug version -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- regex -- filesystem -- system -- thread -- math_tr1 -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "5.0") -- Boost version: 1.57.0 -- Found CUDA: /usr/local/cuda (found version "8.0") -- Could NOT find NVML (missing: NVML_LIBRARY) -- Boost version: 1.57.0 -- Found the following Boost libraries: -- program_options -- Found 'adios_config': /usr/bin/adios_config -- The directory provided by 'adios_config -d' does not exist: -- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.10.0") -- libSplash supports PARALLEL output -- Configuring done -- Generating done -- Build files have been written to: /home/lai/build [ 8%] Linking CXX executable picongpu /usr/bin/ld: warning: libmpi.so.12, needed by /usr/lib/x86_64-linux-gnu/hdf5/openmpi/lib/libhdf5.so, may conflict with libmpi.so.20 [ 50%] Built target picongpu [ 58%] Linking CXX executable cuda_memtest [ 83%] Built target cuda_memtest [100%] Built target mpiInfo Install the project... -- Install configuration: "" -- Installing: /home/lai/paramSets/case001/bin/picongpu -- Set runtime path of "/home/lai/paramSets/case001/bin/picongpu" to "" -- Up-to-date: /home/lai/paramSets/case001/bin -- Installing: /home/lai/paramSets/case001/bin/cuda_memtest.sh -- Installing: /home/lai/paramSets/case001/bin/cuda_memtest -- Set runtime path of "/home/lai/paramSets/case001/bin/cuda_memtest" to "" -- Installing: /home/lai/paramSets/case001/bin/mpiInfo -- Set runtime path of "/home/lai/paramSets/case001/bin/mpiInfo" to ""

I moved to next step "Run". I entered tbg -s qsub -c submit/0016gpus.cfg -t submit/hypnos/k20_profile.tpl $PICHOME/runs/testBatch01 and here is the output:

The given tpl file "submit/hypnos/k20_profile.tpl" does not exist (-t|--tpl).

Maybe I lost some files. How to solve it? I noticed that there is "( Again, do NOT use your home $HOME/runs - change this path!)" in install guide. My path $PICHOME is $HOME. Do I need to change it? Thank you!

ax3l commented 7 years ago

Great, the install worked out! :sparkles:

Just continue reading the manual in-order: https://picongpu.readthedocs.io/en/dev/usage/tbg.html#usage

What you need is the right runtime configuration file in -c for your desktop, I guess you have one GPU, (or adjust an existing one to use one GPU) and the right template in -t which is in your case either mpirun or mpiexec running simply in your bash.

It then looks something like this on a desktop:

tbg -s bash -c submit/0001gpu.cfg -t submit/bash/bash_mpirun.tpl
PrometheusPi commented 7 years ago

Hi @8i161029 , I suspect you ran tbg not from within /home/lai/paramSets/case001. Is this correct? In /home/lai/paramSets/case001 there should be a submit directory which includes *.cfg files and directories with *.tpl files.

Furthermore I assume you are not running picongpu on the hypnos cluster at HZDR in Germany. Thus hypnos/k20_profile.tpl will not create a correct submit file for your cluster, but for the hynos cluster at HZDR.

[EDIT:] only valid on compute clusters [EDIT END] In order to create a submit file for your cluster, please adjust a *.tpl file accordingly. For details please see: https://picongpu.readthedocs.io/en/dev/usage/tbg.html You can find several examples for various clusters for both PBS and slurm in the submit directory. It is probably best to adjust a similar .tbg to your needs.

[EDIT:] this is better described by @ax3l above [EDIT END] To avoid an inital *.tbg file generation for the first runs on your cluster you could also use the bash *.tbg: bash_mpiexec.tpl. This creates a simulation directory and a shell script. By allocating node (with GPUs) interactively on your cluster, you can then just execute the script.

8i161029 commented 7 years ago

Is that I need to modify this order tbg -s qsub -c submit/0016gpus.cfg -t submit/hypnos/k20_profile.tpl $PICHOME/runs/testBatch01? But I can see 0016gpus.cfg and k20_profile.tpl in $HOME/paramSets/case001/submit. Here is ls $HOME/paramSets/case001/submit output:

0001gpus.cfg bash joker-tud openib.conf taurus-tud 0008gpus.cfg cuda.filter judge-fzj pizdaint-cscs titan-ornl 0016gpus.cfg davinci-rice keeneland-gt scorep.filter 0032gpus.cfg hypnos-hzdr lawrencium-lbnl submitAction.sh

output of ls $HOME/paramSets/case001/submit/bash:

bash_mpiexec.tpl bash_mpirun.tpl

and ls $HOME/paramSets/case001/submit/hypnos-hzdr

fermi_profile.tpl k20_vampir_profile.tpl picongpu.profile.example k20_autoWait_profile.tpl k20_wait_profile.tpl k20_profile.tpl k80_profile.tpl

I tried ~/paramSets/case001$ tbg -s bash -c submit/0001gpus.cfg -t submit/bash/bash_mpirun.tpl $HOME/runs/testBatch01

Here is the output:

tbg/submit.start: line 32: ( TBG_tasks + TBG_gpusPerNode -1 ) / TBG_gpusPerNode: division by 0 (error token is "TBG_gpusPerNode") Running program... tbg/submit.start: line 41: /home/lai/picongpu.profile: No such file or directory Data for JOB [2846,1] offset 0 ======================== JOB MAP ======================== Data for node: lai-G56JR Num slots: 4 Max slots: 0 Num procs: 1 Process OMPI jobid: [2846,1] App: 0 Process rank: 0

mpiInfo: error while loading shared libraries: libmpi.so.20: cannot open shared object file: No such file or directory ERROR: Invalid option:2 cuda_memtest crash: see file /home/lai/runs/testBatch01/simOutput/cuda_memtestlai-G56JR.err


Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[2846,1],0] Exit code: 1


Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[2816,1],0] Exit code: 1

It looked like I was wrong. Please teach me. Thank you!

ax3l commented 7 years ago

Great, don't worry. Just take the template file submit/bash/bash_mpirun.tpl and comment out line 41 with a prefix of an #, which tries to load your environment by "sourcing" $HOME/picongpu.profile (similar to what a .bashrc files is used for, just manually and application specific).

The picongpu.profile file is described in the manual and we usually use it to load a software environment on HPC systems. In your desktop-case just remove that a $HOME/picongpu.profile file is required (as written in the paragraph above) or create an empty one with:

touch $HOME/picongpu.profile
8i161029 commented 7 years ago

I added an # in front of . ~/picongpu.profile in submit/bash/bash_mpirun.tpl. Then deleted $HOME/runs/testBatch01 and tried again. But it was the same result. I found

mpiInfo: error while loading shared libraries: libmpi.so.20: cannot open shared object file: No such file or directory ERROR: Invalid option:2 cuda_memtest crash: see file /home/lai/runs/testBatch01/simOutput/cuda_memtestlai-G56JR.err

in the output. What is the problem? I commented out a wrong line? Thank you.

ax3l commented 7 years ago

Let's reduce the template even more, but first do a simple test.

What does a

cd $HOME/paramSets/case001
ldd ./bin/picongpu
mpirun -n 1 ./bin/picongpu -s 10 -d 1 1 1 -g 64 64 64

output for you?

8i161029 commented 7 years ago

output of ldd ./bin/picongpu:

linux-vdso.so.1 => (0x00007ffce8b1e000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007efee3b12000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007efee390e000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007efee3705000) libmpi.so.20 => not found libz.so.1 => /usr/local/lib/libz.so.1 (0x00007efee34e9000) libboost_program_options.so.1.57.0 => not found libboost_regex.so.1.57.0 => not found libboost_filesystem.so.1.57.0 => not found libboost_system.so.1.57.0 => not found libboost_math_tr1.so.1.57.0 => not found libhdf5_openmpi.so.10 => /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.10 (0x00007efee3038000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007efee2d2f000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007efee29ad000) libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007efee278a000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007efee2574000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efee21ab000) /lib64/ld-linux-x86-64.so.2 (0x0000562256015000) libsz.so.2 => /usr/lib/x86_64-linux-gnu/libsz.so.2 (0x00007efee1fa7000) libmpi.so.12 => /usr/lib/libmpi.so.12 (0x00007efee1cd1000) libaec.so.0 => /usr/lib/x86_64-linux-gnu/libaec.so.0 (0x00007efee1ac8000) libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0x00007efee18b9000) libopen-rte.so.12 => /usr/lib/libopen-rte.so.12 (0x00007efee163f000) libopen-pal.so.13 => /usr/lib/libopen-pal.so.13 (0x00007efee13a1000) libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007efee1167000) libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007efee0f64000) libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007efee0d58000) libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007efee0b4e000)

and mpirun -n 1 ./bin/picongpu -s 10 -d 1 1 1 -g 64 64 64:

./bin/picongpu: error while loading shared libraries: libmpi.so.20: cannot open shared object file: No such file or directory

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[27383,1],0] Exit code: 127

ax3l commented 7 years ago

Looks like your mpi library was not found!

Can you search for it via

locate libmpi.so
find /usr/lib -name "libmpi.so*"

?

Also, I prepared a simplified .tpl file for you.

ax3l commented 7 years ago

Actually, it looks like you compiled against two different MPI installations:

output of ldd ./bin/picongpu: ... libmpi.so.20 => not found ... libhdf5_openmpi.so.10 => /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.10 (0x00007efee3038000) ... libmpi.so.12 => /usr/lib/libmpi.so.12 (0x00007efee1cd1000) ...

And above during compile:

/usr/bin/ld: warning: libmpi.so.12, needed by /usr/lib/x86_64-linux-gnu/hdf5/openmpi/lib/libhdf5.so, may conflict with libmpi.so.20

8i161029 commented 7 years ago

Oh no maybe I wrongly install mpi... output of locate libmpi.so :

/etc/alternatives/libmpi.so /home/lai/openmpi-2.1.0/ompi/.libs/libmpi.so /home/lai/openmpi-2.1.0/ompi/.libs/libmpi.so.20 /home/lai/openmpi-2.1.0/ompi/.libs/libmpi.so.20.10.0 /home/lai/openmpi-2.1.0/ompi/.libs/libmpi.so.20.10.0T /usr/lib/libmpi.so /usr/lib/libmpi.so.12 /usr/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/local/openmpi/lib/libmpi.so /usr/local/openmpi/lib/libmpi.so.20 /usr/local/openmpi/lib/libmpi.so.20.10.0

find /usr/lib -name "libmpi.so*":

/usr/lib/libmpi.so.12.0.2 /usr/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/libmpi.so.12

ax3l commented 7 years ago

ah yes, just get rid of your additional MPI installation in your $HOME :-)

rm -rf $HOME/openmpi-2.1.0
# and to be sure, rebuild also the software that depends on MPI
rm -rf $HOME/lib/hdf5
rm -rf $HOME/lib/splash

then

8i161029 commented 7 years ago

I did following things:

tbg/submit.start: line 32: ( TBG_tasks + TBG_gpusPerNode -1 ) / TBG_gpusPerNode: division by 0 (error token is "TBG_gpusPerNode") Running program... Data for JOB [18798,1] offset 0 ======================== JOB MAP ======================== Data for node: lai-G56JR Num slots: 4 Max slots: 0 Num procs: 1 Process OMPI jobid: [18798,1] App: 0 Process rank: 0

mpiInfo: error while loading shared libraries: libmpi.so.20: cannot open shared object file: No such file or directory ERROR: Invalid option:2 cuda_memtest crash: see file /home/lai/runs/testBatch01/simOutput/cuda_memtestlai-G56JR.err

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[18798,1],0] Exit code: 1


Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[19089,1],0] Exit code: 1

I entered locate libmpi.so:

/etc/alternatives/libmpi.so /usr/lib/libmpi.so /usr/lib/libmpi.so.12 /usr/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/local/openmpi/lib/libmpi.so /usr/local/openmpi/lib/libmpi.so.20 /usr/local/openmpi/lib/libmpi.so.20.10.0

and find /usr/lib -name "libmpi.so*":

/usr/lib/libmpi.so.12.0.2 /usr/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/libmpi.so.12

I see there are some openmpi files still in /usr/local/openmpi and /usr/lib . Should I delete them? But I notice that libmpi.so.20 is a part of openmpi. What should I do? Thank you.

ax3l commented 7 years ago

Somehow you have openmpi 1.X mpich and openmpi 2 installed alongside.

also don't install HDF5 from source, use the package from apt-get install and only build libSplash on top of both.

8i161029 commented 7 years ago

Do I still have two versions of openmpi after I used rm -rf $HOME/openmpi-2.1.0? And how to install HDF5 from apt-get install? I tried sudo apt-get install HDF5 but unable to locate. Also what does it mean "build libSplash on top of both"? Sorry for many questions and thank you for helping me.

ax3l commented 7 years ago

Do I still have two versions of openmpi after I used rm -rf $HOME/openmpi-2.1.0?

yes, you somehow have an mpi in /usr/lib/ and in /usr/lib/openmpi/lib/

And how to install HDF5 from apt-get install? I tried sudo apt-get install HDF5 but unable to locate.

See above: sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev (Ubuntu package index)

Also what does it mean "build libSplash on top of both"?

Use apt-get to install MPI and HDF5 with MPI (parallel) support via apt-get as in the line above and only build libSplash (which depends on both) from source.

Sorry for many questions and thank you for helping me.

No problem, you are welcome!

ax3l commented 7 years ago

@8i161029 I won't be online for a week to help you further. My idea would be to try to get rid of the install of MPI in /usr/lib/libmpi.so*, maybe check what mpi packages you have installed with apt list --installed | grep mpi. Are there more then one?

In case it does not work for you to make sure always the same MPI version is used on your system, I would recommend installing nvidia docker to get a clean environment to work with. I have drafted an example in this thread and can guide you through it in 10 days from now in case you have any trouble.

Alternatively, the package manager spack can install and manage all PIConGPU dependencies for you. Anyways, both approaches require you to either learn how to use docker or spack.

8i161029 commented 7 years ago

Thank you for your patient help! I deeply appreciate that. I will try the method you recommended. Thank you very much!

ax3l commented 7 years ago

Can you post the output of apt list --installed | grep mpi please?

8i161029 commented 7 years ago

ok, here is the output:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts. compiz/xenial-updates,xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 all [installed] compiz-core/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compiz-gnome/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compiz-plugins-default/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compizconfig-settings-manager/xenial-updates,xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 all [installed] libcompizconfig0/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] libexempi3/xenial,now 2.2.2-2 amd64 [installed] libhdf5-openmpi-10/xenial,now 1.8.16+docs-4ubuntu1 amd64 [installed,automatic] libhdf5-openmpi-dev/xenial,now 1.8.16+docs-4ubuntu1 amd64 [installed] libmpich-dev/xenial,now 3.2-6build1 amd64 [installed,automatic] libmpich12/xenial,now 3.2-6build1 amd64 [installed,automatic] libopenmpi-dev/xenial,now 1.10.2-8ubuntu1 amd64 [installed] libopenmpi1.10/xenial,now 1.10.2-8ubuntu1 amd64 [installed,automatic] mpich/xenial,now 3.2-6build1 amd64 [installed] openmpi-bin/xenial,now 1.10.2-8ubuntu1 amd64 [installed] openmpi-common/xenial,xenial,now 1.10.2-8ubuntu1 all [installed,automatic] python-compizconfig/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed,automatic]

ax3l commented 7 years ago

Ok, just de-install mpich, this is the second MPI installation you are seeing, but you only need one:

sudo apt-get remove libmpich-dev libmpich12 mpich

then build libSplash and PIConGPU again

8i161029 commented 7 years ago

Hi, I did sudo apt-get remove libmpich-dev libmpich12 mpich, rebuild libSplash and PIConGPU. Then ran again. But it was still an error. Here is output:

lai@lai-G56JR:~/paramSets/case001$ tbg -s bash -c submit/0001gpus.cfg -t submit/bash/bash_mpirun.tpl $HOME/runs/testBatch01 tbg/submit.start: line 32: ( TBG_tasks + TBG_gpusPerNode -1 ) / TBG_gpusPerNode: division by 0 (error token is "TBG_gpusPerNode") Running program... Data for JOB [7202,1] offset 0

======================== JOB MAP ========================

Data for node: lai-G56JR Num slots: 4 Max slots: 0 Num procs: 1 Process OMPI jobid: [7202,1] App: 0 Process rank: 0

=========================================================== mpiInfo: error while loading shared libraries: libmpi.so.20: cannot open shared object file: No such file or directory ERROR: Invalid option:2 cuda_memtest crash: see file /home/lai/runs/testBatch01/simOutput/cuda_memtestlai-G56JR.err

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[7202,1],0] Exit code: 1


Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[7252,1],0] Exit code: 1

I checked installation of MPI: locate libmpi.so:

/etc/alternatives/libmpi.so /usr/lib/libmpi.so /usr/lib/libmpi.so.12 /usr/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/local/openmpi/lib/libmpi.so /usr/local/openmpi/lib/libmpi.so.20 /usr/local/openmpi/lib/libmpi.so.20.10.0

find /usr/lib -name "libmpi.so*":

/usr/lib/libmpi.so.12.0.2 /usr/lib/libmpi.so /usr/lib/openmpi/lib/libmpi.so.12.0.2 /usr/lib/openmpi/lib/libmpi.so /usr/lib/libmpi.so.12

and apt list --installed | grep mpi:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts. compiz/xenial-updates,xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 all [installed] compiz-core/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compiz-gnome/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compiz-plugins-default/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] compizconfig-settings-manager/xenial-updates,xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 all [installed] libcompizconfig0/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed] libexempi3/xenial,now 2.2.2-2 amd64 [installed] libhdf5-openmpi-10/xenial,now 1.8.16+docs-4ubuntu1 amd64 [installed,automatic] libhdf5-openmpi-dev/xenial,now 1.8.16+docs-4ubuntu1 amd64 [installed] libopenmpi-dev/xenial,now 1.10.2-8ubuntu1 amd64 [installed] libopenmpi1.10/xenial,now 1.10.2-8ubuntu1 amd64 [installed,automatic] openmpi-bin/xenial,now 1.10.2-8ubuntu1 amd64 [installed] openmpi-common/xenial,xenial,now 1.10.2-8ubuntu1 all [installed,automatic] python-compizconfig/xenial-updates,now 1:0.9.12.2+16.04.20160823-0ubuntu1 amd64 [installed,automatic]

I can see libmpi.so.20 in /usr/local/openmpi/lib but not in /usr/lib. Is it because of this so terminal can not find libmpi.so.20? How to let libmpi.so.20 be found? Thank you!

ax3l commented 7 years ago

Looks OK so far, can you show your cfg file please? Looks like you entered a GPU dimension (-d) with zero.

8i161029 commented 7 years ago

I used the 0001gpus.cfg from case001/submit. Maybe I should try another cfg file? In submit there are 0001gpus.cfg, 0008gpus.cfg, 0016gpus.cfg, and 0032gpus.cfg. Which file should I use? Thank you!

ax3l commented 7 years ago

Ah sorry, it's in the .tpl file. Can you please check the file submit/bash/bash_mpirun.tpl in your parameter set and see if TBG_gpusPerNode correctly is set to 1?