differentiate reduction function with CUDA atomicAdd

erizmr commented 1 year ago

Hi, I met the issue (shown below) when differentiating a reduction function with atomicAdd. It seems CUDA atomicAdd is not supported in enzyme AD. I am wondering if anyone can please help give any suggestions? Thanks!

in Mode: ReverseModeCombined
cannot handle unknown instruction
  %22 = atomicrmw fadd float* %0, float %21 seq_cst, align 4, !dbg !45fatal error: error in backend: unknown value

The device function to differentiate:

__device__ void compute_loss(float * loss,
                             float * out_feat,
                             const uint32_t N){
    const uint32_t i = threadIdx.x + blockIdx.x * blockDim.x;
    float local_loss = out_feat[i * 2] * out_feat[i * 2] + out_feat[i * 2 + 1] * out_feat[i * 2 + 1];
    atomicAdd(loss, local_loss / N);
}

wsmoses commented 1 year ago

What version/commit are you on?

We should handle that so I'm a bit surprised but if you get that on the main branch we definitely will fix it!

erizmr commented 1 year ago

Hi @wsmoses , thanks for your reply. Indeed I was on a outdated version and now I was trying to build Enzyme on lastest master. However met the issue: error: ‘ElementType’ is not a member of ‘llvm::Attribute::AttrKind’ I am wondering if there is a recommended LLVM for the latest Enzyme build? Thanks.

wsmoses commented 1 year ago

Ah sorry I just meant the latest Enzyme not the latest LLVM. Probably the latest stable release so 16? We generally try to keep up with llvm head (and compiler as of a few days ago), but someone could push a temporary breaking change that takes us time to learn about and adapt to.

erizmr commented 1 year ago

Hi @wsmoses . Thanks for the suggestion. I have successfully built LLVM (16.0.6) and Enzyme (latest main). And met the issue below:

~/llvm-project/build/bin/clang++ -mllvm -max-heap-to-stack-size=1000000 -I /usr/local/cuda-11.4/include -I /home/bx2k/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I /home/bx2k/.local/lib/python3.8/site-packages/torch/include -I ./src/ -I . -ffast-math --cuda-path=/usr/local/cuda-11.4 --cuda-gpu-arch=sm_86  -std=c++14 -Xclang -load -Xclang ~/Enzyme/enzyme/build/Enzyme/ClangEnzyme-16.so -Rpass=enzyme -mllvm -enzyme-max-cache -DALLOCATOR -DABI -O3 -DALLOW_AD=1  -mllvm -enzyme-new-cache=1 -mllvm -enzyme-mincut-cache=1 -DSIZE=20 -mllvm -enzyme-phi-restructure=0 -mllvm -enzyme-coalese -mllvm -enzyme-loop-invariant-cache=1 -c src/raymarching.cu -o src/raymarching.o -D_GLIBCXX_USE_CXX11_ABI=0
ptxas fatal   : Unresolved extern function '_Z17__enzyme_autodiffPviPKfPfiS1_S2_iS1_S2_iS1_S2_iS1_S2_iS1_S2_iS1_S2_ijjPKi'
clang-16: error: ptxas command failed with exit code 255 (use -v to see invocation)
clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
Target: x86_64-unknown-linux-gnu

The compile command line was working fine on Enzyme commit ec75831a8cb0170090c366f8da6e3b2b87a20f6e. Would you mind please giving any suggestions? Thanks.

wsmoses commented 1 year ago

For newer LLVM's the way to load the pass is -fpass-plugin=/path/to/ClangEnzyme-XX.so.

This is technically documented here (https://enzyme.mit.edu/getting_started/Faq/#opt-cant-find--enzyme-option), but very hard to find.

Any PR's to our docs (github.com/EnzymeAD/www) are very appreciated!

erizmr commented 1 year ago

For newer LLVM's the way to load the pass is -fpass-plugin=/path/to/ClangEnzyme-XX.so.

This is technically documented here (https://enzyme.mit.edu/getting_started/Faq/#opt-cant-find--enzyme-option), but very hard to find.

Any PR's to our docs (github.com/EnzymeAD/www) are very appreciated!

Thanks for your reply. For the newer LLVM, I met some Unknown command line argument issues. These arguments include enzyme-max-cache, enzyme-mincut-cache, enzyme-aggressive-aa, enzyme-phi-restructure, enzyme-loop-invariant-cache. I am wondering are there any changes to these arguments (renamed/removed/hidden etc)? Or is it only the newer LLVM issue?

~/llvm-project/build/bin/clang++ -mllvm -max-heap-to-stack-size=1000000 -I /usr/local/cuda-11.4/include -I /home/bx2k/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I /home/bx2k/.local/lib/python3.8/site-packages/torch/include -I ./src/ -I . -ffast-math --cuda-path=/usr/local/cuda-11.4 --cuda-gpu-arch=sm_86  -std=c++14 -fpass-plugin=/home/bx2k/Enzyme/enzyme/build/Enzyme/ClangEnzyme-16.so -Rpass=enzyme -mllvm -enzyme-max-cache -DALLOCATOR -DABI -O3 -DALLOW_AD=1  -c src/raymarching.cu -o src/raymarching.o -D_GLIBCXX_USE_CXX11_ABI=0
clang (LLVM option parsing): Unknown command line argument '-enzyme-max-cache'.  Try: 'clang (LLVM option parsing) --help'

wsmoses commented 1 year ago

Since @tgymnich recently in a different GH issue commented on how to use command line flags on newer llvm's, I'll copy the link here https://github.com/EnzymeAD/Enzyme/issues/1295#issuecomment-1609581638

Separately, since I presume you copied those flags from the SC21 artifact repo, in most cases you shouldn't need to add those flags (and they were used for performing an ablation study on the impact of those optimizations -- the default values should be reasonably performant, and if not we should fix them).

erizmr commented 1 year ago

Thanks for the reply!

minansys commented 3 months ago

@erizmr Have you been able to resolve this issue? I have the same problem as shown here https://github.com/EnzymeAD/Enzyme/issues/2053.

EnzymeAD / Enzyme

differentiate reduction function with CUDA atomicAdd #1288