bh107 / bohrium

Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX
http://www.bh107.org
Apache License 2.0
219 stars 31 forks source link

User kernels are not accounted for in profiling #575

Closed dionhaefner closed 5 years ago

dionhaefner commented 5 years ago

Running the script from #572 with BH_OPENMP_PROF=true gives something like

[OpenMP] Profiling:
Fuse cache hits:                 202/207 (97.5845%)
Codegen cache hits:              0/6 (0%)
Compilation cache hits:          100/107 (93.4579%)
Array contractions:              17/37 (45.9459%)
Outer-fusion ratio:              11/47 (23.4043%)
Malloc cache hits:               1/112 (0.892857%)

Max memory usage:                6422 MB
Syncs to NumPy:                  3
Total Work:                      147938400 operations
Throughput:                      2.43301e+07ops
Work below par-threshold (1000): 0%

Wall clock:                      6.08048s
Total Execution:                 0.970361s
  Pre-fusion:                    0.000152632s
  Fusion:                        0.00046895s
  Codegen:                       0.000547688s
  Compilation:                   0.820653s
  Exec:                          0.129096s
  Copy2dev:                      0s
  Copy2host:                     0s
  Offload:                       0s
  Other:                         0.0194433s
Ext-method:                      0.0283069s

Unaccounted for (wall - total):  5.11011s

Per-kernel Profiling:
  Kernel filename                        Calls         Total time  Max time    Min time
  94091604c79445bc_291c74f0ebb735a3.c             1    5.18e-02s   5.18e-02s   5.18e-02s
  94091604c79445bc_74c00300d3e68155.c             1    2.10e-02s   2.10e-02s   2.10e-02s
  94091604c79445bc_2a1e823edf5c7601.c             1    2.09e-02s   2.09e-02s   2.09e-02s
  94091604c79445bc_437f5925862da730.c             1    1.27e-02s   1.27e-02s   1.27e-02s
  94091604c79445bc_ec89bb1025fbda.c               1    1.18e-02s   1.18e-02s   1.18e-02s
  94091604c79445bc_5eb4209930c8cdeb.c             1    1.09e-02s   1.09e-02s   1.09e-02s

Most time is spent inside the user kernel, but it only shows up as "unaccounted for".

dionhaefner commented 5 years ago

With the recent master, I get something like this:

$ BH_OPENMP_PROF=1 BH_OPENMP_VERBOSE=1 python bh_test.py
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_291c74f0ebb735a3.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_291c74f0ebb735a3.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_291c74f0ebb735a3.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_2a1e823edf5c7601.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_2a1e823edf5c7601.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_2a1e823edf5c7601.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_74c00300d3e68155.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_74c00300d3e68155.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_74c00300d3e68155.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_de8967f5baab9a56.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_de8967f5baab9a56.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_de8967f5baab9a56.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_437f5925862da730.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_437f5925862da730.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_437f5925862da730.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_5eb4209930c8cdeb.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_5eb4209930c8cdeb.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_5eb4209930c8cdeb.so"
Write source "/var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_ec89bb1025fbda.c"
compile command: "/usr/local/opt/llvm/bin/clang -x c -dynamiclib -arch x86_64 -Wno-expansion-to-defined -O3 -Wno-pass-failed -march=native -Werror -fopenmp=libomp -I/Users/dion/.local/share/bohrium/include /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/src/94091604c79445bc_ec89bb1025fbda.c -o /var/folders/ld/ylk19q555951g5k6t0vhz6740000gp/T/bh_ede7/obj/94091604c79445bc_ec89bb1025fbda.so"
Press Enter to quit
[OpenMP] Profiling:
Fuse cache hits:                 202/207 (97.5845%)
Codegen cache hits:              0/6 (0%)
Compilation cache hits:          100/107 (93.4579%)
Array contractions:              17/37 (45.9459%)
Outer-fusion ratio:              11/47 (23.4043%)
Malloc cache hits:               100/112 (89.2857%)

Max memory usage:                548 MB
Syncs to NumPy:                  3
Total Work:                      147938400 operations
Throughput:                      4.22934e+07ops
Work below par-threshold (1000): 0%

Wall clock:                      3.49791s
Total Execution:                 1.43557s
  Pre-fusion:                    0.000214091s
  Fusion:                        0.000527781s
  Codegen:                       0.000578563s
  Compilation:                   1.37932s
  Exec:                          1.58708s
  Copy2dev:                      0s
  Copy2host:                     0s
  Offload:                       0s
  Other:                         -1.53215s
Ext-method:                      0.0248146s

Unaccounted for (wall - total):  2.06233s

Per-kernel Profiling:
  Kernel filename                        Calls         Total time  Max time    Min time
  94091604c79445bc_4d50d89dddbed345.c           101    1.46e+00s   1.89e-02s   1.36e-02s
  94091604c79445bc_291c74f0ebb735a3.c             1    5.27e-02s   5.27e-02s   5.27e-02s
  94091604c79445bc_2a1e823edf5c7601.c             1    1.93e-02s   1.93e-02s   1.93e-02s
  94091604c79445bc_74c00300d3e68155.c             1    1.96e-02s   1.96e-02s   1.96e-02s
  94091604c79445bc_5eb4209930c8cdeb.c             1    1.40e-02s   1.40e-02s   1.40e-02s
  94091604c79445bc_437f5925862da730.c             1    1.17e-02s   1.17e-02s   1.17e-02s
  94091604c79445bc_ec89bb1025fbda.c               1    1.14e-02s   1.14e-02s   1.14e-02s

Observations:

dionhaefner commented 5 years ago

Only outstanding issue is the incorrect "Total Execution" counter. That is because calling a user kernel bypasses EngineCPU::handleExecution which increments the total time counter. I didn't feel comfortable enough to propose a different architecture though.

madsbk commented 5 years ago

Fixed by #591