Consolidation of timing/statistics/profiling

philipportner commented 1 month ago

The current state of DAPHNE compilation and execution "feedback" flags is a bit convoluted. I feel like we should decide on what actually is useful output, and which commands we want to keep / merge / remove / improve. Obviously, some may sound similar but provide completely different output, e.g., --statistics and --enable-profiling. I'm not arguing to remove all but one but to make it clearer and start an effort to improve usability and the functionality itself. The following exist:

--timing Adds high-level stats about startup, compilation, and execution time.

Use case: Provides the user with an high-level overview of time spent.

Example output:

{"startup_seconds": 0.0207349, "parsing_seconds": 0.0469451, "compilation_seconds": 0.199888, "execution_seconds": 4.61694, "total_seconds": 4.88451}

--statistics Enables kernel dispatch and script operator level runtime profiling.

Use case: The user wants to know how long each operator in the script took and what were the heavy hitters that one may want to optimize first.

Example output:

[info]: DAPHNE operator execution runtime statistics.
[info]: #   Operator Name                                                                                  Time(s)  Count        Avg(s)  File:Line:Column
[info]: 0   _transpose__DenseMatrix_double__DenseMatrix_double                                             0.60     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:157:13
[info]: 1   _matMul__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double__bool__bool                0.56     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:157:18
[info]: 2   _ewMul__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double                             0.42     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:157:23
[info]: 3   _ewMul__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double                             0.35     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:158:24
[info]: 4   _ewNeq__DenseMatrix_double__DenseMatrix_double__double                                         0.27     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:157:28
[info]: 5   _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.21     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:167:42
[info]: 6   _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.21     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:167:13
[info]: 7   _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.21     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:95:17
[info]: 8   _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.21     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:163:11
[info]: 9   _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.20     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:95:17
[info]: 10  _sumRow__DenseMatrix_double__DenseMatrix_double                                                0.20     972          0.00    /daphne/scripts/algorithms/decisionTree_.daph:162:11

--enable-profiling

PAPI based profiling, currently only with regions starting and ending at the func::FuncOp scope.

Use case: Possible to get lots of interesting metrics, usefulness is somewhat limited because each PAPI_REGION is only inserted at func::FuncOp scope which makes it hard to profile specific operations. A start/stopProfiling() built-in is in the works and helps with that but is not directly linked with the CLI flag.

Example output:

{
  "papi_version":"7.0.1.0",
  "cpu_info":"AMD EPYC 7443P 24-Core Processor",
  "max_cpu_rate_mhz":"2850",
  "min_cpu_rate_mhz":"1500",
  "event_definitions":{
    "PAPI_TOT_CYC":{
      "component":"perf_event",
      "type":"delta"
    }
  },
  "threads":{
    "0":{
      "regions":{
        "0":{
          "name":"fixme",
          "parent_region_id":"-1",
          "cycles":"11864031101",
          "real_time_nsec":"4170819252",
          "PAPI_TOT_CYC":"10766539549"
        },
        "1":{
          "name":"fixme",
          "parent_region_id":"0",
          "cycles":"11739598504",
          "real_time_nsec":"4127314998",
          "PAPI_TOT_CYC":"10659175953"
        },
        "2":{
          "name":"fixme",
          "parent_region_id":"1",
          "cycles":"243436683",
          "real_time_nsec":"85578550",
          "PAPI_TOT_CYC":"130145272"
        },
        "3":{
          "name":"fixme",
          "parent_region_id":"2",
          "cycles":"259949",
          "real_time_nsec":"90669",
          "PAPI_TOT_CYC":"162691"
        },
        "4":{
          "name":"fixme",
          "parent_region_id":"2",
          "cycles":"231477",
          "real_time_nsec":"81582",
          "PAPI_TOT_CYC":"145396"
        },
        "5":{
          "name":"fixme",
          "parent_region_id":"2",
          "cycles":"1082287",
          "real_time_nsec":"380530",
          "PAPI_TOT_CYC":"682842"
        },
        ..28k more lines..
}

mlir::PassManager::enableTiming() Coming with #739, default timing functionality from MLIR for the compilation pipeline (PassManager).

Use case: Provides a report of time spent in the compilation pipeline. Helpful when introducing complex optimization passes to monitor time/overhead spent during the passes.

Example output:

===-------------------------------------------------------------------------===
                         ... Execution time report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0585 seconds

  ----User Time----  ----Wall Time----  ----Name----
    0.0061 (  5.0%)    0.0015 (  2.5%)  'func.func' Pipeline
    0.0060 (  4.9%)    0.0014 (  2.5%)    InferencePass
    0.0028 (  2.3%)    0.0028 (  4.9%)  Canonicalizer
    0.0005 (  0.4%)    0.0005 (  0.9%)  {anonymous}::PhyOperatorSelectionPass
    0.0005 (  0.4%)    0.0005 (  0.9%)  CSE
    0.0000 (  0.0%)    0.0000 (  0.0%)    (A) DominanceInfo
    0.0083 (  6.8%)    0.0009 (  1.5%)  'func.func' Pipeline
    0.0033 (  2.7%)    0.0004 (  0.7%)    AdaptTypesToKernelsPass
    0.0008 (  0.7%)    0.0002 (  0.3%)    ProfilingPass
    0.0015 (  1.3%)    0.0003 (  0.5%)    InsertDaphneContextPass
    0.0037 (  3.0%)    0.0037 (  6.3%)  Canonicalizer
    0.0006 (  0.5%)    0.0006 (  1.0%)  CSE
    0.0000 (  0.0%)    0.0000 (  0.0%)    (A) DominanceInfo
    0.0462 ( 38.0%)    0.0111 ( 19.0%)  'func.func' Pipeline
    0.0019 (  1.6%)    0.0003 (  0.6%)    ManageObjRefsPass
    0.0428 ( 35.2%)    0.0108 ( 18.4%)    {anonymous}::RewriteToCallKernelOpPass
    0.0015 (  1.2%)    0.0015 (  2.5%)  SCFToControlFlow
    0.0026 (  2.1%)    0.0003 (  0.6%)  'func.func' Pipeline
    0.0019 (  1.6%)    0.0003 (  0.6%)    LLVMRequestCWrappers
    0.0277 ( 22.8%)    0.0277 ( 47.4%)  {anonymous}::DaphneLowerToLLVMPass
    0.0046 (  3.7%)    0.0046 (  7.8%)  ReconcileUnrealizedCasts
    0.0003 (  0.3%)    0.0003 (  0.6%)  Rest
    0.1217 (100.0%)    0.0585 (100.0%)  Total

philipportner commented 1 month ago

All of the above was measure by running bin/daphne --timing --statistics --enable-profiling test/api/cli/algorithms/decisionTreeRealData2.daphne data=\"test/data/wine/winequality-red-white.csv\" dt=1 maxV=1.000000.

The mlir::PassManager output was manually added:

diff --git a/src/compiler/execution/DaphneIrExecutor.cpp b/src/compiler/execution/DaphneIrExecutor.cpp
index b994a3fc..49c683e6 100644
--- a/src/compiler/execution/DaphneIrExecutor.cpp
+++ b/src/compiler/execution/DaphneIrExecutor.cpp
@@ -221,6 +221,7 @@ bool DaphneIrExecutor::runPasses(mlir::ModuleOp module) {
+    pm.enableTiming();

philipportner commented 1 month ago

Some related issues: #206, #423 , #424

daphne-eu / daphne

Consolidation of timing/statistics/profiling #740