Open tom91136 opened 1 year ago
If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:
miniBUDE: compile_commands: - "/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/bin/nvcc -forward-unknown-to-host-compiler -DCUDA -DMEM=MANAGED -DUSE_PPWI="1\\,2\\,4\\,8\\,16\\,32\\,64\\,128" --options-file <OUT>/includes_CUDA.rsp -std=c++17 -forward-unknown-to-host-compiler -arch=sm_61 -use_fast_math -restrict -keep -DNDEBUG -std=c++17 -O3 -march=native -x cu -c <SRC>/main.cpp -o <OUT>/src/main.cpp.o" vcs: commit: e7339d6cd9b832f0ba59ed73d2bc406e4345d495* author: "Tom Lin (tom91136@gmail.com)" date: "2023-10-02 15:21:22 +0100" subject: "Prevent NVHPC from optimising away task barrier (likely a bug)" host_cpu: ~ time: { epoch_s:1698373309, formatted: "Fri Oct 27 02:21:49 2023 GMT" } deck: path: "../data/bm1" poses: 65536 proteins: 938 ligands: 26 forcefields: 34 config: iterations: 8 poses: 65536 ppwi: available: [1,2,4,8,16,32,64,128] selected: [64] wgsize: [512] device: { index: 0, name: "NVIDIA TITAN X (Pascal) (12189MB;sm_61)" } # Device and kernel cc: sm_61 # Verification failed for ppwi=64, wgsize=512; difference exceeded tolerance (0.025%) # Bad energies (failed/total=58671/65536, showing first 8): # index,actual,expected,difference_% # 0,0,865.523,100 # 1,0,25.0715,100 # 2,0,368.434,100 # 3,0,14.6651,100 # 4,0,574.987,100 # 5,0,707.354,100 # 6,0,33.947,100 # 7,0,135.588,100 # (ppwi=64,wgsize=512,valid=0) results: - outcome: { valid: false, max_diff_%: 100.000 } param: { ppwi: 64, wgsize: 512 } raw_iterations: [3.50847,0.00114,0.00047,0.00039,0.00041,0.00038,0.00036,0.00037,0.00034,0.00039] context_ms: 0.635100 sum_ms: 0.003 avg_ms: 0.000 min_ms: 0.000 max_ms: 0.000 stddev_ms: 0.000 giga_interactions/s: 4111361.976 gflop/s: 124067012.898 gfinst/s: 102784049.389 energies: - 0.00 - 0.00 - 0.00 - 0.00 - 0.00 - 0.00 - 0.00 - 0.00 best: { min_ms: 0.00, max_ms: 0.00, sum_ms: 0.00, avg_ms: 0.00, ppwi: 64, wgsize: 512 }
We also need to add a hint in the error such that the missing WGSIZE can be added. Thanks to @jhdavis8 for discovering this.
Update: it's CUDA's wgsize (propagates to threads per blocks) that's failing, PPWI is the one that's define at compile time.
If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:
We also need to add a hint in the error such that the missing WGSIZE can be added. Thanks to @jhdavis8 for discovering this.