RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.64k stars 226 forks source link

[BUG] Perfmon marker file API functions not working #583

Closed OoJJBoO closed 8 months ago

OoJJBoO commented 9 months ago

Describe the bug

As of commit 1a50c17, the perfmon functions provided for working with marker files (perfmon_readMarkerFile, perfmon_getNumberOfRegions, perfmon_getEventsOfRegion, and so on) are effectively broken.

This is because they, as far as I understand, need likwid_markerClose to be called for the marker file to actually be written first, which, since the mentioned commit, also finalizes the performance monitoring facilities. This, in turn, causes perfmon_readMarkerFile to always fail with the error

ERROR - [...] Perfmon module not properly initialized

when trying to read the created marker file afterward because perfmon was already finalized and the needed data does not exist anymore.

To Reproduce

To reproduce, simply run the provided C-internalMarkerAPI example or try calling perfmon_readMarkerFile after likwid_markerClose has written a marker file.

Possible Fix / Workaround

Since it looks to be complicated to remove the perfmon_finalize call from likwid_markerClose, I think another possible way to address this issue would be to add a new function that only writes the marker file without erasing the performance monitoring data. That way, users would be able to write the marker file manually and subsequently read results using the existing API functions.

I already implemented this approach in my local fork and it seems to be working so far after some quick experiments. The likwid_writeMarkerFile function I added is a modified copy of the relevant code in likwid_markerClose with perfmon calls replacing the LikwidResults object created by hashTable_finalize.

Output of C-internalMarkerAPI using the current repository state (commit b611b8e):

gcc -g -fopenmp -DLIKWID_PERFMON -I/usr/local/include -L/usr/local/lib C-internalMarkerAPI.c -o C-internalMarkerAPI -llikwid -lm
LD_LIBRARY_PATH=/usr/local/lib ./C-internalMarkerAPI
calc_flops thread 2 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.339779 s, and call count 1
calc_flops thread 0 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.340276 s, and call count 1
calc_flops thread 1 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.340675 s, and call count 1
detailed event results: 
                       INSTR_RETIRED_ANY:              4500023470.000000
                   CPU_CLK_UNHALTED_CORE:              4513777598.000000
                    CPU_CLK_UNHALTED_REF:              2409655950.000000
FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:                       0.000000
     FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:              1000000012.000000
FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:                       0.000000

final result of flops operations: 863999999247053824.000000
entry 1024 of copy_arr: 0.000000

ERROR - [/home/jj/programs/likwid/src/perfmon.c:perfmon_readMarkerFile:3884] Perfmon module not properly initialized
ERROR - [/home/jj/programs/likwid/src/perfmon.c:perfmon_getNumberOfRegions:3571] Perfmon module not properly initialized
Marker API measured -22 regions
ERROR - [/home/jj/programs/likwid/src/perfmon.c:perfmon_getNumberOfRegions:3571] Perfmon module not properly initialized

detailed results follow. Notice that the region "calc_flops"
will not appear, as it was reset after each time it was measured.

thread :          region :      group :   type :                              result name :                   result value 
ERROR - [/home/jj/programs/likwid/src/perfmon.c:perfmon_getNumberOfRegions:3571] Perfmon module not properly initialized

Output of the same example using my likwid_writeMarkerFile function instead of likwid_markerClose:

gcc -g -fopenmp -DLIKWID_PERFMON -I/usr/local/include -L/usr/local/lib C-internalMarkerAPI.c -o C-internalMarkerAPI -llikwid -lm
LD_LIBRARY_PATH=/usr/local/lib ./C-internalMarkerAPI
calc_flops thread 2 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.339037 s, and call count 1
calc_flops thread 1 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.339194 s, and call count 1
calc_flops thread 0 finished measuring group FLOPS_DP.
Got 6 events, runtime 1.342012 s, and call count 1
detailed event results: 
                       INSTR_RETIRED_ANY:              4500017488.000000
                   CPU_CLK_UNHALTED_CORE:              4525421907.000000
                    CPU_CLK_UNHALTED_REF:              2413805025.000000
FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:                       0.000000
     FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:              1000000012.000000
FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:                       0.000000

final result of flops operations: 863999999247053824.000000
entry 1024 of copy_arr: 0.000000

Marker API measured 9 regions
Region Total with 6 events and 9 metrics
Region calc_flops with 6 events and 9 metrics
Region copy with 6 events and 9 metrics
Region Total with 6 events and 10 metrics
Region calc_flops with 6 events and 10 metrics
Region copy with 6 events and 10 metrics
Region Total with 3 events and 0 metrics
Region calc_flops with 3 events and 0 metrics
Region copy with 3 events and 0 metrics

detailed results follow. Notice that the region "calc_flops"
will not appear, as it was reset after each time it was measured.

thread :          region :      group :   type :                              result name :                   result value 
     0 :           Total :   FLOPS_DP :  event :                        INSTR_RETIRED_ANY :              3768119000.000000 
     0 :           Total :   FLOPS_DP :  event :                    CPU_CLK_UNHALTED_CORE :              2136400000.000000 
     0 :           Total :   FLOPS_DP :  event :                     CPU_CLK_UNHALTED_REF :              1133674000.000000 
     0 :           Total :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE :                       0.000000 
     0 :           Total :   FLOPS_DP :  event :      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE :               200000100.000000 
     0 :           Total :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE :                       0.000000 
     0 :           Total :   FLOPS_DP : metric :                      Runtime (RDTSC) [s] :                       0.630845 
     0 :           Total :   FLOPS_DP : metric :                     Runtime unhalted [s] :                       1.186887 
     0 :           Total :   FLOPS_DP : metric :                              Clock [MHz] :                    3392.090854 
     0 :           Total :   FLOPS_DP : metric :                                      CPI :                       0.566967 
     0 :           Total :   FLOPS_DP : metric :                             DP [MFLOP/s] :                     317.035246 
     0 :           Total :   FLOPS_DP : metric :                         AVX DP [MFLOP/s] :                       0.000000 
     0 :           Total :   FLOPS_DP : metric :                         Packed [MUOPS/s] :                       0.000000 
     0 :           Total :   FLOPS_DP : metric :                         Scalar [MUOPS/s] :                     317.035246 
     0 :           Total :   FLOPS_DP : metric :                  Vectorization ratio [%] :                       0.000000 
     0 :      calc_flops :   FLOPS_DP :  event :                        INSTR_RETIRED_ANY :               900003800.000000 
     0 :      calc_flops :   FLOPS_DP :  event :                    CPU_CLK_UNHALTED_CORE :               906811500.000000 
     0 :      calc_flops :   FLOPS_DP :  event :                     CPU_CLK_UNHALTED_REF :               481197300.000000 
     0 :      calc_flops :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE :                       0.000000 
     0 :      calc_flops :   FLOPS_DP :  event :      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE :               200000000.000000 
     0 :      calc_flops :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE :                       0.000000 
     0 :      calc_flops :   FLOPS_DP : metric :                      Runtime (RDTSC) [s] :                       0.267758 
     0 :      calc_flops :   FLOPS_DP : metric :                     Runtime unhalted [s] :                       0.503783 
     0 :      calc_flops :   FLOPS_DP : metric :                              Clock [MHz] :                    3392.086593 
     0 :      calc_flops :   FLOPS_DP : metric :                                      CPI :                       1.007564 
     0 :      calc_flops :   FLOPS_DP : metric :                             DP [MFLOP/s] :                     746.943135 
     0 :      calc_flops :   FLOPS_DP : metric :                         AVX DP [MFLOP/s] :                       0.000000 
     0 :      calc_flops :   FLOPS_DP : metric :                         Packed [MUOPS/s] :                       0.000000 
     0 :      calc_flops :   FLOPS_DP : metric :                         Scalar [MUOPS/s] :                     746.943135 
     0 :      calc_flops :   FLOPS_DP : metric :                  Vectorization ratio [%] :                       0.000000 
     0 :            copy :   FLOPS_DP :  event :                        INSTR_RETIRED_ANY :              2868104000.000000 
     0 :            copy :   FLOPS_DP :  event :                    CPU_CLK_UNHALTED_CORE :              1229550000.000000 
     0 :            copy :   FLOPS_DP :  event :                     CPU_CLK_UNHALTED_REF :               652455200.000000 
     0 :            copy :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE :                       0.000000 
     0 :            copy :   FLOPS_DP :  event :      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE :                      10.000000 
     0 :            copy :   FLOPS_DP :  event : FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE :                       0.000000 
     0 :            copy :   FLOPS_DP : metric :                      Runtime (RDTSC) [s] :                       0.362882 
     0 :            copy :   FLOPS_DP : metric :                     Runtime unhalted [s] :                       0.683082 
     0 :            copy :   FLOPS_DP : metric :                              Clock [MHz] :                    3392.099559 
     0 :            copy :   FLOPS_DP : metric :                                      CPI :                       0.428698 
     0 :            copy :   FLOPS_DP : metric :                             DP [MFLOP/s] :                       0.000028 
     0 :            copy :   FLOPS_DP : metric :                         AVX DP [MFLOP/s] :                       0.000000 
     0 :            copy :   FLOPS_DP : metric :                         Packed [MUOPS/s] :                       0.000000 
     0 :            copy :   FLOPS_DP : metric :                         Scalar [MUOPS/s] :                       0.000028 
     0 :            copy :   FLOPS_DP : metric :                  Vectorization ratio [%] :                       0.000000 
     0 :           Total :         L2 :  event :                        INSTR_RETIRED_ANY :              3768120000.000000 
     0 :           Total :         L2 :  event :                    CPU_CLK_UNHALTED_CORE :              2141044000.000000 
     0 :           Total :         L2 :  event :                     CPU_CLK_UNHALTED_REF :              1136136000.000000 
     0 :           Total :         L2 :  event :                          L1D_REPLACEMENT :                26614750.000000 
     0 :           Total :         L2 :  event :                              L1D_M_EVICT :                   59827.000000 
     0 :           Total :         L2 :  event :                    ICACHE_64B_IFTAG_MISS :                    3593.000000 
     0 :           Total :         L2 : metric :                      Runtime (RDTSC) [s] :                       0.631935 
     0 :           Total :         L2 : metric :                     Runtime unhalted [s] :                       1.189467 
     0 :           Total :         L2 : metric :                              Clock [MHz] :                    3392.097793 
     0 :           Total :         L2 : metric :                                      CPI :                       0.568200 
     0 :           Total :         L2 : metric :            L2D load bandwidth [MBytes/s] :                    2695.441351 
     0 :           Total :         L2 : metric :            L2D load data volume [GBytes] :                       1.703344 
     0 :           Total :         L2 : metric :           L2D evict bandwidth [MBytes/s] :                       6.059053 
     0 :           Total :         L2 : metric :           L2D evict data volume [GBytes] :                       0.003829 
     0 :           Total :         L2 : metric :                  L2 bandwidth [MBytes/s] :                    2701.864290 
     0 :           Total :         L2 : metric :                  L2 data volume [GBytes] :                       1.707403 
     0 :      calc_flops :         L2 :  event :                        INSTR_RETIRED_ANY :               900003800.000000 
     0 :      calc_flops :         L2 :  event :                    CPU_CLK_UNHALTED_CORE :               904647200.000000 
     0 :      calc_flops :         L2 :  event :                     CPU_CLK_UNHALTED_REF :               480046600.000000 
     0 :      calc_flops :         L2 :  event :                          L1D_REPLACEMENT :                     215.000000 
     0 :      calc_flops :         L2 :  event :                              L1D_M_EVICT :                      57.000000 
     0 :      calc_flops :         L2 :  event :                    ICACHE_64B_IFTAG_MISS :                    1190.000000 
     0 :      calc_flops :         L2 : metric :                      Runtime (RDTSC) [s] :                       0.266924 
     0 :      calc_flops :         L2 : metric :                     Runtime unhalted [s] :                       0.502581 
     0 :      calc_flops :         L2 : metric :                              Clock [MHz] :                    3392.102276 
     0 :      calc_flops :         L2 : metric :                                      CPI :                       1.005159 
     0 :      calc_flops :         L2 : metric :            L2D load bandwidth [MBytes/s] :                       0.051550 
     0 :      calc_flops :         L2 : metric :            L2D load data volume [GBytes] :                       0.000014 
     0 :      calc_flops :         L2 : metric :           L2D evict bandwidth [MBytes/s] :                       0.013667 
     0 :      calc_flops :         L2 : metric :           L2D evict data volume [GBytes] :                       0.000004 
     0 :      calc_flops :         L2 : metric :                  L2 bandwidth [MBytes/s] :                       0.350541 
     0 :      calc_flops :         L2 : metric :                  L2 data volume [GBytes] :                       0.000094 
     0 :            copy :         L2 :  event :                        INSTR_RETIRED_ANY :              2868104000.000000 
     0 :            copy :         L2 :  event :                    CPU_CLK_UNHALTED_CORE :              1236352000.000000 
     0 :            copy :         L2 :  event :                     CPU_CLK_UNHALTED_REF :               656065700.000000 
     0 :            copy :         L2 :  event :                          L1D_REPLACEMENT :                26614110.000000 
     0 :            copy :         L2 :  event :                              L1D_M_EVICT :                   59611.000000 
     0 :            copy :         L2 :  event :                    ICACHE_64B_IFTAG_MISS :                    1081.000000 
     0 :            copy :         L2 : metric :                      Runtime (RDTSC) [s] :                       0.364799 
     0 :            copy :         L2 : metric :                     Runtime unhalted [s] :                       0.686861 
     0 :            copy :         L2 : metric :                              Clock [MHz] :                    3392.094135 
     0 :            copy :         L2 : metric :                                      CPI :                       0.431069 
     0 :            copy :         L2 : metric :            L2D load bandwidth [MBytes/s] :                    4669.153625 
     0 :            copy :         L2 : metric :            L2D load data volume [GBytes] :                       1.703303 
     0 :            copy :         L2 : metric :           L2D evict bandwidth [MBytes/s] :                      10.458096 
     0 :            copy :         L2 : metric :           L2D evict data volume [GBytes] :                       0.003815 
     0 :            copy :         L2 : metric :                  L2 bandwidth [MBytes/s] :                    4679.801370 
     0 :            copy :         L2 : metric :                  L2 data volume [GBytes] :                       1.707187 
     0 :           Total :     Custom :  event :                        INSTR_RETIRED_ANY :              3768118000.000000 
     0 :           Total :     Custom :  event :                    CPU_CLK_UNHALTED_CORE :              2132273000.000000 
     0 :           Total :     Custom :  event :                     CPU_CLK_UNHALTED_REF :              1131480000.000000 
     0 :      calc_flops :     Custom :  event :                        INSTR_RETIRED_ANY :               900003100.000000 
     0 :      calc_flops :     Custom :  event :                    CPU_CLK_UNHALTED_CORE :               903432100.000000 
     0 :      calc_flops :     Custom :  event :                     CPU_CLK_UNHALTED_REF :               479401600.000000 
     0 :            copy :     Custom :  event :                        INSTR_RETIRED_ANY :              2868103000.000000 
     0 :            copy :     Custom :  event :                    CPU_CLK_UNHALTED_CORE :              1228800000.000000 
     0 :            copy :     Custom :  event :                     CPU_CLK_UNHALTED_REF :               652056800.000000 

The results seem to be equivalent to the ones I got when simply removing the perfmon finalization from likwid_markerClose, so my implementation probably does what it's supposed to :)

Do you think this could be an acceptable solution @TomTheBear?

Edit: To make comparing a bit easier, I created a pull reques with the described additions: #584

TomTheBear commented 9 months ago

Hi, many thanks for finding the issue. It is probably there for a long time already because the perfmon module is finalized in likwid_markerClose since the beginning.

Your approach seems reasonable with the only downside that the file will be written twice, once in the new likwid_writeMarkerFile and again in the final LIKWID_MARKER_CLOSE.

The biggest issue is that it requires changes in likwid.h. My approach is to change the main likwid.h only in major releases and I just released 5.3. Maybe I can make an exception because it just adds a function and does not change the behavior of the already published functions.

While reading over C-internalMarkerAPI I saw some typos. Since you already work on that file, would it be possible for you to fix the typos?

OoJJBoO commented 9 months ago

Sure, I'll look into it.

OoJJBoO commented 9 months ago

I just updated the comments in C-internalMarkerAPI.c, feel free to let me know if I missed something @TomTheBear. I also adjusted the description regarding reading marker files since parts of it looked to be outdated.

TomTheBear commented 8 months ago

I merged your PR #584 and started with similar functions for the GPU-related Marker APIs #586 . I think we can close the issue.

OoJJBoO commented 8 months ago

Yes, thank you!