Closed TApplencourt closed 7 months ago
Was able to compile and run.
nscottnichols@x1921c1s0b0n0:~/thapi_PR_test/build/ici/bin> mpirun -n 1 ./iprof -- sycl-ls
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--21h29m41s
BACKEND_ZE | 1 Hostnames | 1 Processes | 1 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeDeviceGetSubDevices | 7.70us | 40.51% | 12 | 641.75ns | 137ns | 2.02us |
zeDeviceGet | 7.41us | 38.99% | 2 | 3.71us | 460ns | 6.95us |
zeDriverGetExtensionFunctionAddress | 2.04us | 10.75% | 2 | 1.02us | 302ns | 1.74us |
zeInit | 873ns | 4.59% | 1 | 873.00ns | 873ns | 873ns |
zeDriverGetApiVersion | 592ns | 3.11% | 1 | 592.00ns | 592ns | 592ns |
zeDriverGet | 388ns | 2.04% | 2 | 194.00ns | 156ns | 232ns |
Total | 19.01us | 100.00% | 20 |
mpirun -n 16 ./iprof -- sycl-ls
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--21h26m35s
BACKEND_ZE | 2 Hostnames | 16 Processes | 16 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeDeviceGetSubDevices | 1.55ms | 85.81% | 192 | 8.06us | 134ns | 221.75us |
zeDeviceGet | 187.53us | 10.40% | 32 | 5.86us | 321ns | 15.36us |
zeDriverGetExtensionFunctionAddress | 33.87us | 1.88% | 32 | 1.06us | 256ns | 3.09us |
zeInit | 13.88us | 0.77% | 16 | 867.44ns | 507ns | 1.22us |
zeDriverGetApiVersion | 11.45us | 0.63% | 16 | 715.81ns | 462ns | 1.12us |
zeDriverGet | 9.21us | 0.51% | 32 | 287.91ns | 134ns | 576ns |
Total | 1.80ms | 100.00% | 320 |
Testing on larger application now.
Works for real applications as well:
mpiexec --no-transfer -n1 ~/thapi_PR_test/build/ici/bin/iprof gpu_tile_compact.sh ./build.d_inl0_hrd0/check.exe 256 256 10
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--22h09m55s
BACKEND_ZE | 1 Hostnames | 1 Processes | 1 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeEventHostSynchronize | 38.63ms | 67.76% | 116 | 333.00us | 168ns | 31.41ms |
zeCommandListAppendMemoryCopy | 7.86ms | 13.79% | 55 | 142.90us | 10.02us | 698.14us |
zeCommandListCreateImmediate | 3.15ms | 5.53% | 3 | 1.05ms | 149.05us | 2.52ms |
zeCommandListAppendLaunchKernel | 2.30ms | 4.04% | 31 | 74.25us | 6.96us | 1.30ms |
zeCommandListAppendBarrier | 1.88ms | 3.29% | 62 | 30.29us | 5.75us | 1.39ms |
zeModuleCreate | 1.08ms | 1.90% | 4 | 271.13us | 174.74us | 488.59us |
zeContextMakeMemoryResident | 585.61us | 1.03% | 10 | 58.56us | 18.12us | 352.34us |
zeModuleDestroy | 450.95us | 0.79% | 4 | 112.74us | 49.91us | 142.12us |
zeMemFree | 427.16us | 0.75% | 9 | 47.46us | 37.98us | 85.64us |
zeEventHostReset | 300.17us | 0.53% | 146 | 2.06us | 1.06us | 8.82us |
zeMemAllocDevice | 120.63us | 0.21% | 10 | 12.06us | 3.49us | 70.20us |
zeEventPoolDestroy | 70.80us | 0.12% | 1 | 70.80us | 70.80us | 70.80us |
zeKernelCreate | 32.08us | 0.06% | 4 | 8.02us | 5.40us | 10.84us |
zeKernelSetGlobalOffsetExp | 28.12us | 0.05% | 31 | 907.23ns | 160ns | 4.37us |
zeKernelSetGroupSize | 14.89us | 0.03% | 31 | 480.45ns | 137ns | 838ns |
zeCommandListDestroy | 14.35us | 0.03% | 3 | 4.78us | 2.38us | 6.90us |
zeEventPoolCreate | 14.16us | 0.02% | 1 | 14.16us | 14.16us | 14.16us |
zeEventCreate | 13.40us | 0.02% | 2 | 6.70us | 715ns | 12.69us |
zeEventDestroy | 7.12us | 0.01% | 2 | 3.56us | 1.38us | 5.74us |
zeDeviceGet | 5.05us | 0.01% | 2 | 2.52us | 360ns | 4.69us |
zeKernelDestroy | 3.65us | 0.01% | 4 | 912.75ns | 450ns | 2.16us |
zeContextDestroy | 2.72us | 0.00% | 1 | 2.72us | 2.72us | 2.72us |
zeContextCreate | 2.56us | 0.00% | 1 | 2.56us | 2.56us | 2.56us |
zeKernelSetIndirectAccess | 2.05us | 0.00% | 4 | 511.50ns | 237ns | 833ns |
zeDriverGetExtensionFunctionAddress | 1.84us | 0.00% | 2 | 920.00ns | 292ns | 1.55us |
zeModuleBuildLogDestroy | 1.79us | 0.00% | 4 | 447.75ns | 173ns | 1.26us |
zeDeviceGetSubDevices | 1.56us | 0.00% | 2 | 779.00ns | 189ns | 1.37us |
zeInit | 791ns | 0.00% | 1 | 791.00ns | 791ns | 791ns |
zeDriverGetApiVersion | 692ns | 0.00% | 1 | 692.00ns | 692ns | 692ns |
zeDriverGet | 557ns | 0.00% | 2 | 278.50ns | 168ns | 389ns |
zeEventQueryStatus | 459ns | 0.00% | 1 | 459.00ns | 459ns | 459ns |
Total | 57.00ms | 100.00% | 550 |
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandListAppendMemoryCopy(M2D) | 5.09ms | 59.31% | 24 | 212.24us | 80ns | 652.80us |
zeCommandListAppendMemoryCopy(D2M) | 2.28ms | 26.59% | 31 | 73.65us | 1.28us | 208.56us |
zeCommandListAppendBarrier | 434.24us | 5.06% | 62 | 7.00us | 400ns | 39.36us |
main::{lambda(sycl::_V1::handler&)#2}[...]const::{lambda(sycl::_V1::group<1>)#1} | 340.80us | 3.97% | 10 | 34.08us | 32.96us | 40.80us |
main::{lambda(sycl::_V1::handler&)#4}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 204.00us | 2.38% | 10 | 20.40us | 19.68us | 22.08us |
main::{lambda(sycl::_V1::handler&)#1}[...]const::{lambda(sycl::_V1::group<1>)#1} | 194.40us | 2.26% | 10 | 19.44us | 17.92us | 23.36us |
main::{lambda(sycl::_V1::handler&)#3}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 37.76us | 0.44% | 1 | 37.76us | 37.76us | 37.76us |
Total | 8.59ms | 100.00% | 148 |
Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads |
Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeCommandListAppendMemoryCopy(M2D) | 125.83MB | 51.49% | 24 | 5.24MB | 8B | 8.39MB |
zeCommandListAppendMemoryCopy(D2M) | 94.37MB | 38.62% | 31 | 3.04MB | 16B | 8.39MB |
zeContextMakeMemoryResident | 24.18MB | 9.90% | 10 | 2.42MB | 65.54kB | 8.39MB |
Total | 244.38MB | 100.00% | 65 |
mpiexec --no-transfer -n 24 -ppn 12 ~/thapi_PR_test/build/ici/bin/iprof gpu_tile_compact.sh ./build.d_inl0_hrd0/check.exe 256 256 10
THAPI: Trace Location /home/nscottnichols/thapi-traces/thapi_aggreg--2024-02-14--22h11m13s
BACKEND_ZE | 2 Hostnames | 24 Processes | 24 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeEventHostSynchronize | 340.06ms | 30.37% | 2784 | 122.15us | 137ns | 49.20ms |
zeCommandListAppendMemoryCopy | 283.57ms | 25.32% | 1320 | 214.82us | 9.82us | 24.76ms |
zeCommandListAppendLaunchKernel | 150.40ms | 13.43% | 744 | 202.15us | 6.87us | 24.53ms |
zeCommandListCreateImmediate | 129.25ms | 11.54% | 72 | 1.80ms | 57.91us | 10.42ms |
zeCommandListAppendBarrier | 67.42ms | 6.02% | 1488 | 45.31us | 5.41us | 25.20ms |
zeMemAllocDevice | 52.21ms | 4.66% | 240 | 217.55us | 2.44us | 18.78ms |
zeModuleCreate | 37.01ms | 3.30% | 96 | 385.52us | 145.30us | 800.71us |
zeContextMakeMemoryResident | 17.15ms | 1.53% | 240 | 71.46us | 15.34us | 655.13us |
zeModuleDestroy | 13.15ms | 1.17% | 96 | 136.99us | 48.02us | 223.46us |
zeMemFree | 12.89ms | 1.15% | 216 | 59.68us | 36.85us | 246.48us |
zeEventHostReset | 8.23ms | 0.73% | 3504 | 2.35us | 1.00us | 222.95us |
zeEventPoolDestroy | 3.74ms | 0.33% | 24 | 155.92us | 88.18us | 224.81us |
zeEventCreate | 1.40ms | 0.12% | 48 | 29.11us | 528ns | 126.02us |
zeKernelCreate | 770.74us | 0.07% | 96 | 8.03us | 3.49us | 15.02us |
zeKernelSetGlobalOffsetExp | 561.80us | 0.05% | 744 | 755.10ns | 138ns | 6.81us |
zeCommandListDestroy | 410.43us | 0.04% | 72 | 5.70us | 2.53us | 13.08us |
zeKernelSetGroupSize | 396.58us | 0.04% | 744 | 533.03ns | 133ns | 11.98us |
zeEventPoolCreate | 372.49us | 0.03% | 24 | 15.52us | 12.40us | 25.53us |
zeEventDestroy | 171.65us | 0.02% | 48 | 3.58us | 1.42us | 7.01us |
zeDeviceGet | 136.51us | 0.01% | 48 | 2.84us | 329ns | 13.31us |
zeKernelDestroy | 98.34us | 0.01% | 96 | 1.02us | 326ns | 4.77us |
zeContextCreate | 89.78us | 0.01% | 24 | 3.74us | 3.07us | 4.60us |
zeContextDestroy | 72.86us | 0.01% | 24 | 3.04us | 2.19us | 13.00us |
zeModuleBuildLogDestroy | 58.81us | 0.01% | 96 | 612.62ns | 133ns | 11.06us |
zeKernelSetIndirectAccess | 50.16us | 0.00% | 96 | 522.49ns | 199ns | 1.20us |
zeDriverGetExtensionFunctionAddress | 48.04us | 0.00% | 48 | 1.00us | 227ns | 5.31us |
zeDeviceGetSubDevices | 44.02us | 0.00% | 48 | 916.98ns | 188ns | 2.11us |
zeInit | 21.82us | 0.00% | 24 | 909.00ns | 372ns | 1.42us |
zeEventQueryStatus | 18.73us | 0.00% | 24 | 780.38ns | 490ns | 1.21us |
zeDriverGet | 16.48us | 0.00% | 48 | 343.44ns | 154ns | 834ns |
zeDriverGetApiVersion | 15.39us | 0.00% | 24 | 641.29ns | 312ns | 3.13us |
Total | 1.12s | 100.00% | 13200 |
Device profiling | 2 Hostnames | 24 Processes | 24 Threads | 24 Devices | 1 Subdevices |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandListAppendMemoryCopy(M2D) | 123.65ms | 59.48% | 576 | 214.67us | 80ns | 741.28us |
zeCommandListAppendMemoryCopy(D2M) | 54.82ms | 26.37% | 744 | 73.68us | 1.28us | 240.88us |
zeCommandListAppendBarrier | 10.75ms | 5.17% | 1488 | 7.22us | 400ns | 41.76us |
main::{lambda(sycl::_V1::handler&)#2}[...]const::{lambda(sycl::_V1::group<1>)#1} | 8.11ms | 3.90% | 240 | 33.79us | 32.00us | 40.96us |
main::{lambda(sycl::_V1::handler&)#4}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 5.06ms | 2.43% | 240 | 21.09us | 19.68us | 42.40us |
main::{lambda(sycl::_V1::handler&)#1}[...]const::{lambda(sycl::_V1::group<1>)#1} | 4.59ms | 2.21% | 240 | 19.12us | 14.88us | 55.04us |
main::{lambda(sycl::_V1::handler&)#3}[...]nst::{lambda(sycl::_V1::nd_item<1>)#1} | 926.24us | 0.45% | 24 | 38.59us | 37.60us | 41.76us |
Total | 207.89ms | 100.00% | 3552 |
Explicit memory traffic (BACKEND_ZE) | 2 Hostnames | 24 Processes | 24 Threads |
Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeCommandListAppendMemoryCopy(M2D) | 3.02GB | 51.49% | 576 | 5.24MB | 8B | 8.39MB |
zeCommandListAppendMemoryCopy(D2M) | 2.26GB | 38.62% | 744 | 3.04MB | 16B | 8.39MB |
zeContextMakeMemoryResident | 580.39MB | 9.90% | 240 | 2.42MB | 65.54kB | 8.39MB |
Total | 5.87GB | 100.00% | 1560 |
good to go (will mostly conflict with the other one PR sadly, will handle it)
Based on an idea on @nscottnichols
This PR add a new binary (
sync_daemon_fs
) who is responsible of performing 2 operation:The communication is on
signal
, so need to for some loop.The end goal is to implement another daemon that will use MPI (for faster and more robust SYNC) using the same API. Then a compile time, we can choose with daemon to use (the filesystem base one, or the MPI one).