argonne-lcf / THAPI

A tracing infrastructure for heterogeneous computing applications.
Other
22 stars 9 forks source link

Handle sync/barrier with external binary #178

Closed TApplencourt closed 7 months ago

TApplencourt commented 8 months ago

Based on an idea on @nscottnichols

This PR add a new binary (sync_daemon_fs) who is responsible of performing 2 operation:

The communication is on signal, so need to for some loop.

The end goal is to implement another daemon that will use MPI (for faster and more robust SYNC) using the same API. Then a compile time, we can choose with daemon to use (the filesystem base one, or the MPI one).

nscottnichols commented 8 months ago

Was able to compile and run.

nscottnichols@x1921c1s0b0n0:~/thapi_PR_test/build/ici/bin> mpirun -n 1 ./iprof -- sycl-ls
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--21h29m41s
BACKEND_ZE | 1 Hostnames | 1 Processes | 1 Threads | 

                               Name |    Time | Time(%) | Calls |  Average |   Min |    Max |         
              zeDeviceGetSubDevices |  7.70us |  40.51% |    12 | 641.75ns | 137ns | 2.02us |         
                        zeDeviceGet |  7.41us |  38.99% |     2 |   3.71us | 460ns | 6.95us |         
zeDriverGetExtensionFunctionAddress |  2.04us |  10.75% |     2 |   1.02us | 302ns | 1.74us |         
                             zeInit |   873ns |   4.59% |     1 | 873.00ns | 873ns |  873ns |         
              zeDriverGetApiVersion |   592ns |   3.11% |     1 | 592.00ns | 592ns |  592ns |         
                        zeDriverGet |   388ns |   2.04% |     2 | 194.00ns | 156ns |  232ns |         
                              Total | 19.01us | 100.00% |    20 |
mpirun -n 16 ./iprof -- sycl-ls
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--21h26m35s
BACKEND_ZE | 2 Hostnames | 16 Processes | 16 Threads | 

                               Name |     Time | Time(%) | Calls |  Average |   Min |      Max |         
              zeDeviceGetSubDevices |   1.55ms |  85.81% |   192 |   8.06us | 134ns | 221.75us |         
                        zeDeviceGet | 187.53us |  10.40% |    32 |   5.86us | 321ns |  15.36us |         
zeDriverGetExtensionFunctionAddress |  33.87us |   1.88% |    32 |   1.06us | 256ns |   3.09us |         
                             zeInit |  13.88us |   0.77% |    16 | 867.44ns | 507ns |   1.22us |         
              zeDriverGetApiVersion |  11.45us |   0.63% |    16 | 715.81ns | 462ns |   1.12us |         
                        zeDriverGet |   9.21us |   0.51% |    32 | 287.91ns | 134ns |    576ns |         
                              Total |   1.80ms | 100.00% |   320 |

Testing on larger application now.

nscottnichols commented 7 months ago

Works for real applications as well:

mpiexec --no-transfer -n1 ~/thapi_PR_test/build/ici/bin/iprof gpu_tile_compact.sh ./build.d_inl0_hrd0/check.exe 256 256 10

THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--22h09m55s
BACKEND_ZE | 1 Hostnames | 1 Processes | 1 Threads | 

                               Name |     Time | Time(%) | Calls |  Average |      Min |      Max |         
             zeEventHostSynchronize |  38.63ms |  67.76% |   116 | 333.00us |    168ns |  31.41ms |         
      zeCommandListAppendMemoryCopy |   7.86ms |  13.79% |    55 | 142.90us |  10.02us | 698.14us |         
       zeCommandListCreateImmediate |   3.15ms |   5.53% |     3 |   1.05ms | 149.05us |   2.52ms |         
    zeCommandListAppendLaunchKernel |   2.30ms |   4.04% |    31 |  74.25us |   6.96us |   1.30ms |         
         zeCommandListAppendBarrier |   1.88ms |   3.29% |    62 |  30.29us |   5.75us |   1.39ms |         
                     zeModuleCreate |   1.08ms |   1.90% |     4 | 271.13us | 174.74us | 488.59us |         
        zeContextMakeMemoryResident | 585.61us |   1.03% |    10 |  58.56us |  18.12us | 352.34us |         
                    zeModuleDestroy | 450.95us |   0.79% |     4 | 112.74us |  49.91us | 142.12us |         
                          zeMemFree | 427.16us |   0.75% |     9 |  47.46us |  37.98us |  85.64us |         
                   zeEventHostReset | 300.17us |   0.53% |   146 |   2.06us |   1.06us |   8.82us |         
                   zeMemAllocDevice | 120.63us |   0.21% |    10 |  12.06us |   3.49us |  70.20us |         
                 zeEventPoolDestroy |  70.80us |   0.12% |     1 |  70.80us |  70.80us |  70.80us |         
                     zeKernelCreate |  32.08us |   0.06% |     4 |   8.02us |   5.40us |  10.84us |         
         zeKernelSetGlobalOffsetExp |  28.12us |   0.05% |    31 | 907.23ns |    160ns |   4.37us |         
               zeKernelSetGroupSize |  14.89us |   0.03% |    31 | 480.45ns |    137ns |    838ns |         
               zeCommandListDestroy |  14.35us |   0.03% |     3 |   4.78us |   2.38us |   6.90us |         
                  zeEventPoolCreate |  14.16us |   0.02% |     1 |  14.16us |  14.16us |  14.16us |         
                      zeEventCreate |  13.40us |   0.02% |     2 |   6.70us |    715ns |  12.69us |         
                     zeEventDestroy |   7.12us |   0.01% |     2 |   3.56us |   1.38us |   5.74us |         
                        zeDeviceGet |   5.05us |   0.01% |     2 |   2.52us |    360ns |   4.69us |         
                    zeKernelDestroy |   3.65us |   0.01% |     4 | 912.75ns |    450ns |   2.16us |         
                   zeContextDestroy |   2.72us |   0.00% |     1 |   2.72us |   2.72us |   2.72us |         
                    zeContextCreate |   2.56us |   0.00% |     1 |   2.56us |   2.56us |   2.56us |         
          zeKernelSetIndirectAccess |   2.05us |   0.00% |     4 | 511.50ns |    237ns |    833ns |         
zeDriverGetExtensionFunctionAddress |   1.84us |   0.00% |     2 | 920.00ns |    292ns |   1.55us |         
            zeModuleBuildLogDestroy |   1.79us |   0.00% |     4 | 447.75ns |    173ns |   1.26us |         
              zeDeviceGetSubDevices |   1.56us |   0.00% |     2 | 779.00ns |    189ns |   1.37us |         
                             zeInit |    791ns |   0.00% |     1 | 791.00ns |    791ns |    791ns |         
              zeDriverGetApiVersion |    692ns |   0.00% |     1 | 692.00ns |    692ns |    692ns |         
                        zeDriverGet |    557ns |   0.00% |     2 | 278.50ns |    168ns |    389ns |         
                 zeEventQueryStatus |    459ns |   0.00% |     1 | 459.00ns |    459ns |    459ns |         
                              Total |  57.00ms | 100.00% |   550 |                                          

Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices | 

                                                                            Name |     Time | Time(%) | Calls |  Average |     Min |      Max |         
                                              zeCommandListAppendMemoryCopy(M2D) |   5.09ms |  59.31% |    24 | 212.24us |    80ns | 652.80us |         
                                              zeCommandListAppendMemoryCopy(D2M) |   2.28ms |  26.59% |    31 |  73.65us |  1.28us | 208.56us |         
                                                      zeCommandListAppendBarrier | 434.24us |   5.06% |    62 |   7.00us |   400ns |  39.36us |         
main::{lambda(sycl::_V1::handler&)#2}[...]const::{lambda(sycl::_V1::group<1>)#1} | 340.80us |   3.97% |    10 |  34.08us | 32.96us |  40.80us |         
main::{lambda(sycl::_V1::handler&)#4}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 204.00us |   2.38% |    10 |  20.40us | 19.68us |  22.08us |         
main::{lambda(sycl::_V1::handler&)#1}[...]const::{lambda(sycl::_V1::group<1>)#1} | 194.40us |   2.26% |    10 |  19.44us | 17.92us |  23.36us |         
main::{lambda(sycl::_V1::handler&)#3}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} |  37.76us |   0.44% |     1 |  37.76us | 37.76us |  37.76us |         
                                                                           Total |   8.59ms | 100.00% |   148 |                                         

Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads | 

                              Name |     Byte | Byte(%) | Calls | Average |     Min |    Max |         
zeCommandListAppendMemoryCopy(M2D) | 125.83MB |  51.49% |    24 |  5.24MB |      8B | 8.39MB |         
zeCommandListAppendMemoryCopy(D2M) |  94.37MB |  38.62% |    31 |  3.04MB |     16B | 8.39MB |         
       zeContextMakeMemoryResident |  24.18MB |   9.90% |    10 |  2.42MB | 65.54kB | 8.39MB |         
                             Total | 244.38MB | 100.00% |    65 |
mpiexec --no-transfer -n 24 -ppn 12 ~/thapi_PR_test/build/ici/bin/iprof gpu_tile_compact.sh ./build.d_inl0_hrd0/check.exe 256 256 10 

THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--22h11m13s
BACKEND_ZE | 2 Hostnames | 24 Processes | 24 Threads | 

                               Name |     Time | Time(%) | Calls |  Average |      Min |      Max |         
             zeEventHostSynchronize | 340.06ms |  30.37% |  2784 | 122.15us |    137ns |  49.20ms |         
      zeCommandListAppendMemoryCopy | 283.57ms |  25.32% |  1320 | 214.82us |   9.82us |  24.76ms |         
    zeCommandListAppendLaunchKernel | 150.40ms |  13.43% |   744 | 202.15us |   6.87us |  24.53ms |         
       zeCommandListCreateImmediate | 129.25ms |  11.54% |    72 |   1.80ms |  57.91us |  10.42ms |         
         zeCommandListAppendBarrier |  67.42ms |   6.02% |  1488 |  45.31us |   5.41us |  25.20ms |         
                   zeMemAllocDevice |  52.21ms |   4.66% |   240 | 217.55us |   2.44us |  18.78ms |         
                     zeModuleCreate |  37.01ms |   3.30% |    96 | 385.52us | 145.30us | 800.71us |         
        zeContextMakeMemoryResident |  17.15ms |   1.53% |   240 |  71.46us |  15.34us | 655.13us |         
                    zeModuleDestroy |  13.15ms |   1.17% |    96 | 136.99us |  48.02us | 223.46us |         
                          zeMemFree |  12.89ms |   1.15% |   216 |  59.68us |  36.85us | 246.48us |         
                   zeEventHostReset |   8.23ms |   0.73% |  3504 |   2.35us |   1.00us | 222.95us |         
                 zeEventPoolDestroy |   3.74ms |   0.33% |    24 | 155.92us |  88.18us | 224.81us |         
                      zeEventCreate |   1.40ms |   0.12% |    48 |  29.11us |    528ns | 126.02us |         
                     zeKernelCreate | 770.74us |   0.07% |    96 |   8.03us |   3.49us |  15.02us |         
         zeKernelSetGlobalOffsetExp | 561.80us |   0.05% |   744 | 755.10ns |    138ns |   6.81us |         
               zeCommandListDestroy | 410.43us |   0.04% |    72 |   5.70us |   2.53us |  13.08us |         
               zeKernelSetGroupSize | 396.58us |   0.04% |   744 | 533.03ns |    133ns |  11.98us |         
                  zeEventPoolCreate | 372.49us |   0.03% |    24 |  15.52us |  12.40us |  25.53us |         
                     zeEventDestroy | 171.65us |   0.02% |    48 |   3.58us |   1.42us |   7.01us |         
                        zeDeviceGet | 136.51us |   0.01% |    48 |   2.84us |    329ns |  13.31us |         
                    zeKernelDestroy |  98.34us |   0.01% |    96 |   1.02us |    326ns |   4.77us |         
                    zeContextCreate |  89.78us |   0.01% |    24 |   3.74us |   3.07us |   4.60us |         
                   zeContextDestroy |  72.86us |   0.01% |    24 |   3.04us |   2.19us |  13.00us |         
            zeModuleBuildLogDestroy |  58.81us |   0.01% |    96 | 612.62ns |    133ns |  11.06us |         
          zeKernelSetIndirectAccess |  50.16us |   0.00% |    96 | 522.49ns |    199ns |   1.20us |         
zeDriverGetExtensionFunctionAddress |  48.04us |   0.00% |    48 |   1.00us |    227ns |   5.31us |         
              zeDeviceGetSubDevices |  44.02us |   0.00% |    48 | 916.98ns |    188ns |   2.11us |         
                             zeInit |  21.82us |   0.00% |    24 | 909.00ns |    372ns |   1.42us |         
                 zeEventQueryStatus |  18.73us |   0.00% |    24 | 780.38ns |    490ns |   1.21us |         
                        zeDriverGet |  16.48us |   0.00% |    48 | 343.44ns |    154ns |    834ns |         
              zeDriverGetApiVersion |  15.39us |   0.00% |    24 | 641.29ns |    312ns |   3.13us |         
                              Total |    1.12s | 100.00% | 13200 |                                          

Device profiling | 2 Hostnames | 24 Processes | 24 Threads | 24 Devices | 1 Subdevices | 

                                                                            Name |     Time | Time(%) | Calls |  Average |     Min |      Max |         
                                              zeCommandListAppendMemoryCopy(M2D) | 123.65ms |  59.48% |   576 | 214.67us |    80ns | 741.28us |         
                                              zeCommandListAppendMemoryCopy(D2M) |  54.82ms |  26.37% |   744 |  73.68us |  1.28us | 240.88us |         
                                                      zeCommandListAppendBarrier |  10.75ms |   5.17% |  1488 |   7.22us |   400ns |  41.76us |         
main::{lambda(sycl::_V1::handler&)#2}[...]const::{lambda(sycl::_V1::group<1>)#1} |   8.11ms |   3.90% |   240 |  33.79us | 32.00us |  40.96us |         
main::{lambda(sycl::_V1::handler&)#4}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} |   5.06ms |   2.43% |   240 |  21.09us | 19.68us |  42.40us |         
main::{lambda(sycl::_V1::handler&)#1}[...]const::{lambda(sycl::_V1::group<1>)#1} |   4.59ms |   2.21% |   240 |  19.12us | 14.88us |  55.04us |         
main::{lambda(sycl::_V1::handler&)#3}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 926.24us |   0.45% |    24 |  38.59us | 37.60us |  41.76us |         
                                                                           Total | 207.89ms | 100.00% |  3552 |                                         

Explicit memory traffic (BACKEND_ZE) | 2 Hostnames | 24 Processes | 24 Threads | 

                              Name |     Byte | Byte(%) | Calls | Average |     Min |    Max |         
zeCommandListAppendMemoryCopy(M2D) |   3.02GB |  51.49% |   576 |  5.24MB |      8B | 8.39MB |         
zeCommandListAppendMemoryCopy(D2M) |   2.26GB |  38.62% |   744 |  3.04MB |     16B | 8.39MB |         
       zeContextMakeMemoryResident | 580.39MB |   9.90% |   240 |  2.42MB | 65.54kB | 8.39MB |         
                             Total |   5.87GB | 100.00% |  1560 |
TApplencourt commented 7 months ago

good to go (will mostly conflict with the other one PR sadly, will handle it)