UoB-HPC / miniBUDE

A BUDE virtual-screening benchmark, in many programming models
Apache License 2.0
24 stars 13 forks source link

V2 fails to prevent invalid wgsizes from launching #29

Open tom91136 opened 1 year ago

tom91136 commented 1 year ago

If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:

miniBUDE:  
compile_commands:
   - "/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/bin/nvcc -forward-unknown-to-host-compiler -DCUDA -DMEM=MANAGED -DUSE_PPWI="1\\,2\\,4\\,8\\,16\\,32\\,64\\,128" --options-file <OUT>/includes_CUDA.rsp  -std=c++17 -forward-unknown-to-host-compiler -arch=sm_61 -use_fast_math -restrict -keep   -DNDEBUG -std=c++17 -O3 -march=native -x cu -c <SRC>/main.cpp -o <OUT>/src/main.cpp.o"
vcs:
  commit:  e7339d6cd9b832f0ba59ed73d2bc406e4345d495*
  author:  "Tom Lin (tom91136@gmail.com)"
  date:    "2023-10-02 15:21:22 +0100"
  subject: "Prevent NVHPC from optimising away task barrier (likely a bug)"
host_cpu:
  ~
time: { epoch_s:1698373309, formatted: "Fri Oct 27 02:21:49 2023 GMT" }
deck:
  path:         "../data/bm1"
  poses:        65536
  proteins:     938
  ligands:      26
  forcefields:  34
config:
  iterations:   8
  poses:        65536
  ppwi:
    available:  [1,2,4,8,16,32,64,128]
    selected:   [64]
  wgsize:       [512]
device: { index: 0,  name: "NVIDIA TITAN X (Pascal) (12189MB;sm_61)" }
# Device and kernel cc: sm_61
# Verification failed for ppwi=64, wgsize=512; difference exceeded tolerance (0.025%)
# Bad energies (failed/total=58671/65536, showing first 8): 
# index,actual,expected,difference_%
# 0,0,865.523,100
# 1,0,25.0715,100
# 2,0,368.434,100
# 3,0,14.6651,100
# 4,0,574.987,100
# 5,0,707.354,100
# 6,0,33.947,100
# 7,0,135.588,100
# (ppwi=64,wgsize=512,valid=0)
results:
  - outcome:             { valid: false, max_diff_%: 100.000 }
    param:               { ppwi: 64, wgsize: 512 }
    raw_iterations:      [3.50847,0.00114,0.00047,0.00039,0.00041,0.00038,0.00036,0.00037,0.00034,0.00039]
    context_ms:          0.635100
    sum_ms:              0.003
    avg_ms:              0.000
    min_ms:              0.000
    max_ms:              0.000
    stddev_ms:           0.000
    giga_interactions/s: 4111361.976
    gflop/s:             124067012.898
    gfinst/s:            102784049.389
    energies:            
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
best: { min_ms: 0.00, max_ms: 0.00, sum_ms: 0.00, avg_ms: 0.00, ppwi: 64, wgsize: 512 }

We also need to add a hint in the error such that the missing WGSIZE can be added. Thanks to @jhdavis8 for discovering this.

tom91136 commented 1 year ago

Update: it's CUDA's wgsize (propagates to threads per blocks) that's failing, PPWI is the one that's define at compile time.