JeffersonLab / chroma

The Chroma Software System for Lattice QCD
http://jeffersonlab.github.io/chroma
Other
58 stars 50 forks source link

Multinode Support #59

Closed sreevalli92 closed 2 years ago

sreevalli92 commented 4 years ago

I want to try CHROMA on multinode multigpu setup.Does it supports and scalable?

eromero-vlc commented 2 years ago

Yes, it does. You need to compile chroma with QDP-JIT, https://github.com/JeffersonLab/qdp-jit, instead of QPD++. And, launch as many processes as GPUs are on each node.

cpviolator commented 2 years ago

Are there any definitive build instructions available for a QDP-JIT + CHROMA + QMP stack? I have a difficult time selecting the correct combination of CUDA, llvm compiler, CHROMA branch, and QDP-JIT branch that can robustly build on NVIDIA architectures.

fwinter commented 2 years ago

Since the main complications for building stems from qdp-jit let me throw together some instructions how to build this.

cpviolator commented 2 years ago

I just now saw that the devel branch has CMake support, that's great! I'm building LLVM 13 out of the box and will try to link QDP-JIT against it with CHROMA master. I have an Nc agnostic stack with QUDA support for the inverter.

If the LLVM 13 builds and I can link "out of the box" (or with minor tweaks to the default flags) I can rustle up some CMake code that allows the user to pull LLVM from source and build as a dependency.

fwinter commented 2 years ago

Yes, currently you want to use the devel branches from qdp-jit and chroma. Master it is for QMP. CMake is a requirement now, autotools doesn't swing it no more. LLVM 13 is good too. Do "release" build with enabled target "nvptx", no need for clang etc. C++20 compiler is required (GCC 10 or 11) unless you turn off the propagator optimizations in which case a C++14 compiler can build it.

Building LLVM as a dependency sounds intriguing. Definitely curious.

cpviolator commented 2 years ago

I made some decent headway with LLVM and QMP, but QIO, xpath_reader, and filedb are missing. Must one download and build them manually?

Building For CUDA
 QDP++:  Configuring System
 QDP++:  Nc=4
 QDP++:  Nd=4
 QDP++:  Ns=4
 QDP++: Configuring for Parscalar build
 QDP++: Enabling CB2 (4D Checkerboard) Layout
 QDP++: Setting Base Precision to 64
 QDP++: Setting alignment size to 64
 Found LLVM 13.0.1
 Using LLVMConfig.cmake in /usr/workspace/howarth1/StealthDM/qdpjit_sandbox/install_llvm/lib/cmake/llvm
 CMake Error at CMakeLists.txt:293 (add_subdirectory):
   The source directory

     /usr/workspace/howarth1/StealthDM/qdpjit_sandbox/qdp-jit/other_libs/xpath_reader

   does not contain a CMakeLists.txt file.

 CMake Error at CMakeLists.txt:298 (add_subdirectory):
   The source directory

     /usr/workspace/howarth1/StealthDM/qdpjit_sandbox/qdp-jit/other_libs/filedb

   does not contain a CMakeLists.txt file.

 CMake Error at CMakeLists.txt:302 (add_subdirectory):
   The source directory

     /usr/workspace/howarth1/StealthDM/qdpjit_sandbox/qdp-jit/other_libs/qio

   does not contain a CMakeLists.txt file.

 Configuring incomplete, errors occurred!
 See also "/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/build/CMakeFiles/CMakeOutput.log".
 See also "/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/build/CMakeFiles/CMakeError.log".
fwinter commented 2 years ago

The directories are there? Make sure to clone recursively and also when you switched branches that the submodules got updated (git submodule update --recursive)

cpviolator commented 2 years ago

Thank @fwinter I always forget about recursive cloning. May I ask why QMP is not also part of the CMake dependency tree?

With my own gcc 10 install and out-of-the-box LLVM I can cleanly install QDPJIT devel. I'm pretty sure one can add a download and build option for LLVM, but I guess it's cleaner for the user to ensure that they have a sufficient gcc! I encountered some errors when compiling with Nc=4 on the CHROMA side, I can work through those as needed.

cpviolator commented 2 years ago

OK, very close...

The stack builds all the way to CHROMA executable link time. The error is posted here:

howarth1@lassen708:/usr/WS2/howarth1/StealthDM/qdpjit_sandbox/build_chroma-Nc4$ make    
[  0%] Built target qdp_lapack
[ 91%] Built target chromalib
[ 92%] Linking CXX executable purgaug
/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/install/lib/libjit.a(qdp_llvm.cc.o): In function `QDP::llvm_init_libdevice()':
qdp_llvm.cc:(.text+0x2084): undefined reference to `llvm::parseBitcodeFile(llvm::MemoryBufferRef, llvm::LLVMContext&, llvm::function_ref<llvm::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (llvm::StringRef)>)'
/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/install/lib/libjit.a(qdp_llvm.cc.o): In function `QDP::llvm_backend_init_rocm()':
qdp_llvm.cc:(.text+0x2590): undefined reference to `llvm::TargetRegistry::lookupTarget(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/install/lib/libjit.a(qdp_llvm.cc.o): In function `QDP::llvm_backend_init_cuda()':
qdp_llvm.cc:(.text+0x44e0): undefined reference to `llvm::TargetRegistry::lookupTarget(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llvm::Triple&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
collect2: error: ld returned 1 exit status
make[2]: *** [mainprogs/main/CMakeFiles/purgaug.dir/build.make:279: mainprogs/main/purgaug] Error 1
make[1]: *** [CMakeFiles/Makefile2:377: mainprogs/main/CMakeFiles/purgaug.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

There are three LLVM functions that are missing. They exist in the LLVM (13) source. I'll roll back to LLVM 12 and see if that fixes anything.

cpviolator commented 2 years ago

Same with LLVM 12. Linking errors are usually fixable with some CMake tweaking, I'll give it some more thought. If you have a QDP-JIT + CHROMA stack script I'll happily take it, but if not I can post my findings here. I'll need to make some minor pushes to QDP-JIT for Nc>3 builds (adding llvm_ne, merging CHROMA master into devel) but nothing major.

fwinter commented 2 years ago

Maybe it's nothing but I saw you're using LLVM 13.0.1. (Your output: Found LLVM 13.0.1). The library was developed with version 13.0.0. It's probably nothing, but where did you find that version. I can't see it on llvm.org.

To understand better these linking errors, how about building with 'make VERBOSE=1'.

You need to add llvm_ne? as in 'not equal"? I'm going to try and build with Nc=4. Let me see what happens..

EDIT: With Nc=4 qdpjit builds fine, chroma did not. The reunit routine seems to be Nc=3 specific and needs guarding.

cpviolator commented 2 years ago

After merging master into development for chroma the reunited issue is resolved. It's probably this addition that required the llvm_ne

https://github.com/JeffersonLab/chroma/commit/f3adf17def1925076289ff9099ad59264498c138#diff-f993e2f937d5d0576a26fc860e16be634f3f00052515dc82f439878afed130a5R404

line 404 on the commit.

cpviolator commented 2 years ago

This is the gitlog from my LLVM, pulled from githut and branch release/13.x

howarth1@lassen709:/usr/workspace/howarth1/StealthDM/qdpjit_sandbox/llvm-project$ git log
commit 73daeb3d507f7c8da52a35311ec1799f161ac7a5 (HEAD -> release/13.x, origin/release/13.x)
Author: Artem Belevich <tra@google.com>
Date:   Wed Sep 29 15:02:36 2021 -0700

    [CUDA] Make sure <string.h> is included with original __THROW defined.

    Otherwise we may end up with an inconsistent redeclarations of the standard
    library functions if _FORTIFY_SOURCE is in effect.

    https://bugs.llvm.org/show_bug.cgi?id=47869

    Differential Revision: https://reviews.llvm.org/D110781

    (cherry picked from commit 29e00b29f76adb15a51c1ccd6c1fdb6fce5f4d7b)
cpviolator commented 2 years ago

Also getting errors building the QDP examples. If you grant push rights I can push the version I'm working with.

fwinter commented 2 years ago

Hey Dean. Indeed llvm_ne was missing. It was actually called in a templated function but standard Chroma never instantiated it. I can see how you ran into this. Fixed.

I see you're using typedefs like LatticeColorMatrixFNC. Isn't this exactly what LatticeColorMatrixF is..? In which case I'd suggest to use the latter. Then there's no need to add these.

I see you're checking out LLVM via git pull. I've done this in the past but have switched to using tagged releases. More predictable outcome. I have thrown together some build instructions. Not sure this helps with the linking issue.

There are other routines that are Nc3 specific, like polylp. From eyeballing this it seems it's silly to restrict it the Nc3. Might just work for general Nc.

cpviolator commented 2 years ago

So the issue with linking was a PEBKAU type. Lassen has only gcc 8 and I was using custom built gcc 12 (which is doubly odd because I pulled gcc 10 src). By turning off contraction optimisations the stack built and linked with gcc 8 and Nc=4, except for some Nc=3 specific routines that I can either turn on or instantiate manually.

For qdpjit, I placed NC guards around ColorCross and the baryon contractions. I can also quickly write up some CMake that will pull a git tagged LLVM and build with the correct target type for NVIDIA or AMD arch. The same can also be done for QMP. Is that something you'd like in QDP-JIT?

Thank you very much for your help, saved a lot of time!

cpviolator commented 2 years ago

Oh, the LatticeColorMatrixFNC and LatticeColorMatrixDNC are just preprocessor defines that preserve Nc=3 optimisations in CHROMA. If Nc=3 is selected then the templates are instantiated with the optimal Nc=3 args, else they carry the generic QDP_NC.