[Feature Request] Support data format defined by LLNL/hatchet

jrmadsen commented 3 years ago

So, to my knowledge, SPOT only supports data in the Caliper .cali format. I am setting up the SPOT container at NERSC where we will definitely be using SPOT to track Kokkos's performance but I am also working on setting up some automated performance analysis for our users. This won't be centered around the .cali data format which means that unless SPOT supports a more standardized data format, I won't be able to leverage the excellent work y'all have done creating SPOT. I feel like using Hatchet to support multiple output formats is the obvious choice here as the ability to handle output from multiple sources and convert them into a standardized data structure does or will exist for Caliper, timemory, TAU, Score-P, HPC-toolkit, raw json, etc.

mplegendre commented 3 years ago

Correct, SPOT currently only reads the .cali format (and a SPOTv1 JSON format that I'd prefer to just deprecate).

We've had some discussions with a couple tools on bringing their formats into SPOT. Fortunately we've got a single python location where we do the data access: https://github.com/LLNL/spotbe/blob/master/spot.py, and this exports a SPOT-specific JSON format. Most any other file format should be readable and convertible to our JSON. So we'll be able to get away with just making additions in this file.

Hatchet's not a perfect match here. Some of the data SPOT requires (the call trees and metrics) fits really well into hatchet, and we support exporting that data into hatchet. But the metadata and the time series data do not fit into hatchet, and it wouldn't be able to be a conversion mechanism without some expansion of Hatchet's scope.

So we're currently looking more at making a per-tool reader. Is there a specific tool you'd be interested in?

jrmadsen commented 3 years ago

I am not familiar with what you mean by a "time-series" but after poking around I gather you mean either comparisons of values/times for different... loop iterations? Or is it different runs? I think I could see the issue in hatchet w.r.t. loop iterations.

Ignoring the time-series issue temporarily, what if the metadata came from a separate json, e.g. metadata.json and that had a list of the files to process in hatchet? I'm just curious here bc maintaining multiple per-tool readers is something I am not eager to do bc I've already written a hatchet one for the specific tool I am interested in getting SPOT support for, timemory.

mplegendre commented 3 years ago

Time series data is performance data that isn't associated with a code location, but is associated with a point in time. Caliper's big example is memory bandwidth data. Bandwidth is driven by different cores/sockets (which may be running different apps), the prefetcher, and devices. It can't be associated with code very well, but it can be associated with time.

I'll talk to @slabasan about this at our next SPOT meeting. Even if we don't want to expand hatchet scope, maybe we could use some of the components from hatchet as a reader.

Beyond, the file format issues, do you have thoughts on how you'd collect run metadata for integration with timemory? Would you use adiak (https://github.com/llnl/adiak)? Do you have something similar in the timemory API?

jrmadsen commented 3 years ago

Time series data is performance data that isn't associated with a code location, but is associated with a point in time. Caliper's big example is memory bandwidth data. Bandwidth is driven by different cores/sockets (which may be running different apps), the prefetcher, and devices. It can't be associated with code very well, but it can be associated with time.

Ah thanks for this. I haven't focused much on this sort of capability yet but the support is there as i use it when I build the timem command-line tool (which is basically UNIX time but add rusage metrics + some /proc/<PID> values + PAPI hardware counters). I would be pretty easy to just dump to the same JSON layout as caliper for these in the future though -- each component pretty much has full control of how it writes the entries for its JSON file. Metrics/components associated with labels/hierarchies all have the same general structure with minor variations in whether the data is represented as a scalar or array (or in some cases like the roofline components, additional named metrics like "flop_rate") but that wouldn't be ideal for this sort of data anyhow so at least I now have a layout to target.

do you have thoughts on how you'd collect run metadata for integration with timemory?

That metadata.json thing is how I already handle things (since the flexibility in writing component-specific JSON entries makes having one coherent JSON rather difficult). I just added in the ability to supplement the metadata in a branch that will get merged soon. Right now i just called it "user" but that is likely to change before the merge.

I can put in a type-trait for components that will trigger the metadata JSON to have a field which indicates "these are the JSON files SPOT should be able to read", e.g. the "SPOT" section at the end (rest shown for context):

 {
  "timemory": {
   "metadata": {
    "user": {
     "CPU_FEATURES": "FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM6 4 TSCTMR AVX1.0 RDRAND F16C RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT SGXLC MDCLEAR TSXFA IBRS STIBP L1DF SSBD",
     "CPU_FREQUENCY": "2900000000",
     "CPU_MODEL": "Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz",
     "CPU_VENDOR": "GenuineIntel",
     "HW_CONCURRENCY": "12",
     "HW_L1_CACHE_SIZE": "32768",
     "HW_L2_CACHE_SIZE": "262144",
     "HW_L3_CACHE_SIZE": "12582912",
     "HW_PHYSICAL_CPU": "6",
     "TIMEMORY_API": "tim::project::timemory",
     "TIMEMORY_GIT_DESCRIBE": "v3.1.0-128-gb397b182",
     "TIMEMORY_GIT_REVISION": "b397b1824db42999aacd4b79cad609135c2aa1e2",
     "TIMEMORY_VERSION": "3.2.0.dev4",
     "launch_time": "2021-01-21_03.10_PM"
    },
    "output": {
     "json": [
      {
       "key": "wall",
       "value": [
        "timemory-kokkosp-output/2021-01-21_03.10_PM/wall.flamegraph.json",
        "timemory-kokkosp-output/2021-01-21_03.10_PM/wall.json",
        "timemory-kokkosp-output/2021-01-21_03.10_PM/wall.tree.json"
       ]
      },
      {
       "key": "peak_rss",
       "value": [
        "timemory-kokkosp-output/2021-01-21_03.10_PM/peak_rss.json",
        "timemory-kokkosp-output/2021-01-21_03.10_PM/peak_rss.tree.json"
       ]
      },
     ],
     "SPOT": [
      {
       "key": "wall",
       "value": [
        "timemory-kokkosp-output/2021-01-21_03.10_PM/wall.json",
       ]
      }

Would you use adiak (https://github.com/llnl/adiak)?

I've looked through the ADIAK docs at least twice now and planned to add support for it but, until now. it kept getting bumped by higher priority things in timemory or kokkos.

jrmadsen commented 3 years ago

@mplegendre Is there a list of what metadata is required somewhere?

mplegendre commented 3 years ago

Is there a list of what metadata is required somewhere?

Absolute minimum metadata for functional SPOT is having the 'launchday' metadata attribute. In practice, other useful metadata are the MPI-related items, figures-of-merit, launch and runtimes, the problem size, and enabled packages. It can depend on your goals, but extra metadata isn't particularly harmful and can be hidden if not needed.

And following up on the original question--we are going to convert SPOT to use Hatchet's file readers. The work hasn't started yet, but is planned.

LLNL / spotfe

[Feature Request] Support data format defined by LLNL/hatchet #1