cylc / cylc-admin

Project planning for the Cylc Workflow Engine.
https://cylc.github.io/cylc-admin/
GNU General Public License v3.0
5 stars 13 forks source link

profiling: run cylc 7 profile tests against cylc 8 #141

Open oliver-sanders opened 2 years ago

oliver-sanders commented 2 years ago

Cylc 7 had an on board automated profile battery which provided us with insights into Cylc's scaling characteristics.

This profile battery will need a total re-write for Cylc 8, there is an issue for that here #38, however, this is a relatively large job and one for which we don't have time at the moment so I suggest manually running some of the old scaling tests to give us an idea of how scaling has changed and how Cylc 8 compares againt Cylc 7.

Quick overview of the main scaling dimensions of Cylc:

These dimensions are targeted by suites which were located in etc/dev-suites in the Cylc 7 source code.

For MO people some (rather dated) results from the old automated profile-battery can be found here: https://www-nwp/~osanders/profiling-results/2016/

To test scaling you will need to run the same experiment with different template variables.

Notes:

  • I suggest ramping up the scaling factors slowly, you can easily take down your machine with profiling experiments.
  • The chicken switch is killall python!
  • If you keep adding parallel tasks (without an internal Cylc queue), then you will eventually hit your machines fork limit which will cause task failures (and a very unhappy box).

To profile them I suggest using /usr/bin/time as we have a good level of understanding of this command and results will be comparable to prior tests.

Notes:

  • time and /usr/bin/time may differ in your shell.
  • Always use verbose mode.
  • For Darwin install GNU time.
$ /usr/bin/time -o <output-file> -v cylc play <workflow> -s <option> ... --no-detach

In the output we are interested in:

For fair tests you need to run all experiments on the same hardware and to ensure the machine is idle. Running on a cluster may be a good idea but you may need to claim the whole node to meet these conditions (with the obvious resource wastage implications).

We don't have any up-to-date results for Cylc 7 so you will need to profile that too, however, you can use the automated profile battery there if you like. I think you run it like so:

$ cylc profile-battery -e complex -v 7.x.x
hjoliver commented 2 years ago

Yeah this should be a priority once rc1 is out. (Feature freeze, more or less, then bug-fix, profile, optimize...)

Cylc 8 ought to be a lot more efficient, but we haven't really proved that yet.

MetRonnie commented 2 years ago

Here is my first go at profiling:

Small workflow

Datetime cycling example from tutorials; tasks with no scripts/sleeps.

flow.cylc ```ini [scheduler] allow implicit tasks = True UTC mode = True [scheduling] initial cycle point = 2000-01-01T00Z final cycle point = 2000-01-04T00Z [[graph]] PT3H = """ get_observations_belmullet & get_observations_camborne & get_observations_heathrow & get_observations_shetland => consolidate_observations """ +PT6H/PT6H = """ consolidate_observations[-PT6H] => forecast consolidate_observations[-PT3H] => forecast consolidate_observations & get_rainfall => forecast forecast => post_process_exeter """ +PT12H/PT6H = """ forecast[-PT6H] => forecast """ ```
7.8.10 8.0rc1.dev0[^1]
Wall clock time 0:01:46 0:04:03
User + Sys time (s) 21.0 31.4
% of CPU this job got 19 12
Max resident set size (kB) 27,464 48,704

[^1]: Commit: 72781934, Python 3.9.7

Complex workflow

A while ago I had taken the "Complex" suite from Oliver's ~/cylc-run and updated it to Cylc 8 syntax: https://gist.github.com/MetRonnie/98d42d0b090b1fee7f24a000b90be0af

Tasks with 1s sleeps (note for some reason I used env-script instead of script in [runtime][root])

7.8.10 8.0rc1.dev0[^1]
Wall clock time 0:25:18 1:11:06
User + Sys time (s) 469.3 771.9
% of CPU this job got 30 18
Max resident set size (kB) 71,664 101,748

P.S. An indication of how verbose the logging has become (at log level INFO) - I noticed there were 9 log files for Cylc 8 vs 3 for Cylc 7. And the first 4 of the log files for Cylc 8 had been deleted as the default is to only keep 5 at any time.

oliver-sanders commented 2 years ago

The Cylc 7 results are as expected, the Cylc 8 results less so, will need to investigate. To help reproduce these results:

MetRonnie commented 2 years ago

Ran on my VLD box while not using it (with only a VSCode window with a few tabs, a file manager window and 3 terminal sessions open). Cylc 7 was (/usr/bin/python) 2.7.5, Cylc 8 was (~/.conda/envs/cylc8/bin/python) 3.9.7.

Also, I did the Cylc 8 complex test while keeping the VDI connection open during the middle of the day (doing stuff in Windows in the meantime). Whereas for the Cylc 7 complex test I started it last thing in the evening and closed the VDI connection to let it run overnight. Don't know if a difference in terms of load on the VLD servers depending on time of day, or whether the virtual desktop connection was open or not would affect the results.

MetRonnie commented 2 years ago

Profiled the small datetime-cycling workflow using memory-profiler Figure_1

Figure_2

MetRonnie commented 2 years ago

Validation of the complex workflow: Figure_3 Figure_4

C.f. ~100 MB max RSS during running of the workflow

oliver-sanders commented 2 years ago

I've profiled cylc validate complex two ways to give an inner and an outer measurement.

The program as a whole weighs ~3.5 times more than the config alone adding up to ~38Mb.

Measuring the overheads of a "Hello World" Python script I get a number of 6909952 b, subtracting that from the 38Mb leaves us with 30Mb of un accounted memory.

51855360 - (14322992 + 6909952) = 30622416 b

This is approximately 60% of the total usage.

From introspection it looks like the bulk of that is str objects.

Python does fill up memory with a lot of _code objects representing the system code itself, I have seen these get quite large (Python2.6). Will take a bit more investigation to determine what this 30Mb is being used for.

Methodology

Obtaining the inner measurement:

diff --git a/cylc/flow/scripts/validate.py b/cylc/flow/scripts/validate.py
index 53a9b824f..d1c134323 100755
--- a/cylc/flow/scripts/validate.py
+++ b/cylc/flow/scripts/validate.py
@@ -168,3 +168,5 @@ def main(parser: COP, options: 'Values', reg: str) -> None:

     print(cparse('<green>Valid for cylc-%s</green>' % CYLC_VERSION))
     profiler.stop()
+    from pympler.asizeof import asized
+    print(asized(cfg, detail=2).format(depth=2))
$ cylc validate complex
...
<cylc.flow.config.WorkflowConfig object at 0x7f80e9030ad0> size=14322992 flat=64
    __dict__ size=14322928 flat=416
        [V] taskdefs: {'ensemble_install_startdata_cold': <c....def.TaskDef object at 0x7f80eb1dce60>} size=9974040 flat=36976
        [V] edges: {<cylc.flow.cycling.iso8601.ISO8601Seq....mble_forecast_012_tN', False, False)}} size=2080024 flat=248
        [V] pcfg: <cylc.flow.cfgspec.workflow.RawWorkflowConfig object at 0x7f80e7c8f6d0> size=1615208 flat=64
        [V] runtime: {'parents': {'root': [], 'HPC': ['root....fication_to_highway_happyland_yay2']}} size=625096 flat=248
        [V] ns_defn_order: ['root', 'HPC', 'HPC_BACKUP', 'CORETYP....tification_to_highway_happyland_yay2'] size=10040 flat=10040
        [V] leaves: ['archive_logs', 'brief_archive', 'bri....', 'recover1', 'recover2', 'recover3'] size=9200 flat=9200
        [V] clock_offsets: {'long_start_00': <cylc.flow.cycling.i....601Interval object at 0x7f80ea82f5d0>} size=1736 flat=656
        [V] options: <Values at 0x7f80e565c690: {'check_cir....emplatevars_file': None, 'icp': None}> size=1688 flat=64

Obtaining the outer measurement:

$ # (after reversing the above diff)
$ /usr/bin/time -lp cylc validate complex  # note: BSD time args, use -v for GNU time
...
real         1.96
user         1.85
sys          0.10
  51855360  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     24744  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         8  voluntary context switches
       170  involuntary context switches

Obtaining the Python baseline:

$ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real         0.03
user         0.02
sys          0.00
   6909952  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      1706  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
         2  involuntary context switches
oliver-sanders commented 2 years ago

Simple test comparing Python with and without importing cylc.flow.scheduler.Scheduler shows the Cylc code (including dependencies) occupies 33.7Mb in memory at runtime. A similar number to the one in the post above.

40,615,936 b - 6,909,952 b = 33,705,984 b
$ cat test.py
from cylc.flow.scheduler import Scheduler
$ /usr/bin/time -lp python test.py
real         0.47
user         0.34
sys          0.10
  40615936  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     10017  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
        49  voluntary context switches
        73  involuntary context switches
$ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real         0.03
user         0.02
sys          0.00
   6909952  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      1706  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
         1  involuntary context switches
MetRonnie commented 2 years ago

Using /usr/bin/time on cylc validate datetime-cycling:

If I strip cylc/flow/scripts/validate.py down to just

def main(*a, **k) -> None:
    """cylc validate CLI."""
    return

Max RSS: 19,248 kB

Adding the imports back in

Code snippet ```python from ansimarkup import parse as cparse from optparse import Values import sys import textwrap from cylc.flow import LOG, __version__ as CYLC_VERSION from cylc.flow.config import WorkflowConfig from cylc.flow.exceptions import ( WorkflowConfigError, TaskProxySequenceBoundsError, TriggerExpressionError ) import cylc.flow.flags from cylc.flow.loggingutil import disable_timestamps from cylc.flow.option_parsers import ( CylcOptionParser as COP, Options ) from cylc.flow.profiler import Profiler from cylc.flow.task_proxy import TaskProxy from cylc.flow.templatevars import get_template_vars from cylc.flow.terminal import cli_function from cylc.flow.workflow_files import parse_reg def main(*a, **k) -> None: """cylc validate CLI.""" return ```

Max RSS: 31,520 kB

And with the unmodified file I get max RSS: 35,632 kB

In graphical form:

image

For comparison in Cylc 7 the full cylc validate has max RSS: 16,428 kB

MetRonnie commented 2 years ago

SnakeViz for cylc play datetime-cycling --profile -n

image

If <method 'poll' of 'select.epoll' objects> is sleeps, are we sleeping too much? Considering that Cylc 7 was able to run it in half the wall clock time.

Zooming in: image

(Note: everything under update_profiler_logs on the left can be ignored as that doesn't run when not using --profile)


Snakeviz for cylc play complex --profile -n:

image

Zooming in to the left (red squiggly bit is profiling itself so can be ignored):

image

Zooming in to the centre

image

MetRonnie commented 2 years ago

Possible memory leak? (This is not the full workflow run - I stopped it before completion) Figure_5

oliver-sanders commented 2 years ago

If <method 'poll' of 'select.epoll' objects> is sleeps, are we sleeping too much? Considering that Cylc 7 was able to run it in half the wall clock time.

That's mostly await asyncio.sleep which is to be expected. If the workflow takes longer to complete (which id did) there will be more iterations of the main loop, so more sleep between iterations as you said.

We can reduce the main loop sleep for more responsiveness, however, that raises CPU usage.

oliver-sanders commented 2 years ago

Here's the complex workflow running with --main-loop 'log memory'

cylc flow main_loop log_memory

The largest object @12Mb is the config, followed by the task pool@4Mb, nothing to explain the >100Mb reported which suggests the remaining memory is perhaps not associated directly with the Scheduler object?

The orange line sloping upwards which looks like a memory leak is the data that generated this graph accumulating.

MetRonnie commented 2 years ago

cylc play complex

(Scheduler start is everything up to the start of the main loop)

Looking into these reveals:

scheduler_cli imports

Scheduler start

hjoliver commented 2 years ago

P.S. An indication of how verbose the logging has become (at log level INFO)

That was deliberate (on my part anyway) to help with Cylc 8 development. Expected to pare it back a lot once things stabilized a bit.

hjoliver commented 2 years ago

SnakeViz for cylc play datetime-cycling --profile -n

@MetRonnie - did you do this for Cylc 7 too?

MetRonnie commented 2 years ago

No, Python 2 vs 3 might make it difficult but I'll try

MetRonnie commented 2 years ago

The small datetime-cycling workflow run over dozens of cycles. Memory usage creeping up monotonically Figure_6

MetRonnie commented 2 years ago

The Cylc 7 cylc run --profile didn't seem to produce a .prof file to use with Snakeviz. However, the top functions by cumulative time are (for the complex suite)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.230    0.230 1513.650 1513.650 ~/github/cylc-7/lib/cylc/scheduler.py:1603(run)
     1136    4.405    0.004  721.388    0.635 ~/github/cylc-7/lib/cylc/suite_db_mgr.py:187(process_queued_ops)
     1135    0.009    0.000  721.269    0.635 ~/github/cylc-7/lib/cylc/scheduler.py:1275(process_suite_db_queue)
     2272    2.048    0.001  674.076    0.297 ~/github/cylc-7/lib/cylc/rundb.py:401(execute_queued_items)
    18876    0.160    0.000  453.895    0.024 ~/github/cylc-7/lib/cylc/rundb.py:454(_execute_stmt)
    18876  433.598    0.023  433.598    0.023 {method 'executemany' of 'sqlite3.Connection' objects}
      864    0.387    0.000  229.521    0.266 ~/github/cylc-7/lib/cylc/scheduler.py:1245(process_task_pool)
     1942  212.488    0.109  212.488    0.109 {method 'commit' of 'sqlite3.Connection' objects}
     1135  199.721    0.176  199.721    0.176 {time.sleep}
      864    0.171    0.000  168.005    0.194 ~/github/cylc-7/lib/cylc/task_job_mgr.py:180(submit_task_jobs)
      864    0.269    0.000  159.935    0.185 ~/github/cylc-7/lib/cylc/task_job_mgr.py:159(prep_submit_task_jobs)
     3480    0.523    0.000  159.659    0.046 ~/github/cylc-7/lib/cylc/task_job_mgr.py:733(_prep_submit_task_job)
    10841    0.091    0.000  144.661    0.013 /usr/lib64/python2.7/subprocess.py:475(_eintr_retry_call)
     3480    8.404    0.002  127.785    0.037 ~/github/cylc-7/lib/cylc/job_file.py:44(write)
     1136    0.738    0.001  120.515    0.106 ~/github/cylc-7/lib/cylc/scheduler.py:1669(update_state_summary)
     5039    0.334    0.000   86.496    0.017 /usr/lib64/python2.7/subprocess.py:650(__init__)
     5039    1.094    0.000   85.533    0.017 /usr/lib64/python2.7/subprocess.py:1188(_execute_child)
     1136    0.061    0.000   83.884    0.074 ~/github/cylc-7/lib/cylc/subprocpool.py:172(process)
     1134    0.048    0.000   76.739    0.068 ~/github/cylc-7/lib/cylc/scheduler.py:1590(update_profiler_logs)
      398    0.009    0.000   75.389    0.189 ~/github/cylc-7/lib/cylc/subprocpool.py:158(_proc_exit)
      398    0.012    0.000   75.338    0.189 ~/github/cylc-7/lib/cylc/subprocpool.py:358(_run_command_exit)
      398    0.010    0.000   75.304    0.189 ~/github/cylc-7/lib/cylc/task_job_mgr.py:690(_submit_task_jobs_callback)
      398    0.196    0.000   75.294    0.189 ~/github/cylc-7/lib/cylc/task_job_mgr.py:476(_manip_task_jobs_callback)
     1134    0.127    0.000   74.937    0.066 ~/github/cylc-7/lib/cylc/scheduler.py:2031(_update_cpu_usage)
    20930    0.114    0.000   73.991    0.004 /usr/lib64/python2.7/logging/__init__.py:1249(_log)
      956   13.481    0.014   73.941    0.077 ~/github/cylc-7/lib/cylc/state_summary_mgr.py:50(update)
    20930    0.069    0.000   72.642    0.003 /usr/lib64/python2.7/logging/__init__.py:1270(handle)
    20930    0.171    0.000   72.549    0.003 /usr/lib64/python2.7/logging/__init__.py:1302(callHandlers)
    41846    0.142    0.000   72.377    0.002 /usr/lib64/python2.7/logging/__init__.py:736(handle)
    10440    0.271    0.000   72.236    0.007 ~/github/cylc-7/lib/cylc/task_events_mgr.py:286(process_message)
     5534   71.470    0.013   71.470    0.013 {posix.read}

So a lot of CPU time taken up by DB ops in Cylc 7 too

dwsutherland commented 2 years ago

Not sure I understand the time difference: 0:25:18 vs 1:11:06

Seems kinda impossible... It's almost as if the job communication isn't working and it's switched to polling mode...

Is the memory overhead the loaded libraries?

MetRonnie commented 2 years ago

How can you tell if the job communication is happening via polling?

I'm getting

    # CYLC TASK ENVIRONMENT:
    export CYLC_TASK_COMMS_METHOD=zmq

in job scripts.

Example job.status:

CYLC_JOB_RUNNER_NAME=background
CYLC_JOB_ID=17391
CYLC_JOB_RUNNER_SUBMIT_TIME=2022-02-01T18:54:42Z
CYLC_JOB_PID=17391
CYLC_JOB_INIT_TIME=2022-02-01T18:54:49Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-01T18:54:54Z

which is submit --[7s]--> init --[5s]--> exit for a task with script = sleep 1

dwsutherland commented 2 years ago

let's say there's 5000 sequential tasks (maybe more in parallel), then if each job/task message takes 200ms longer with 3 messages each job. That's 3000sec or 50min extra, which would account for the difference.. (I don't know the exact numbers).

I'm, of course, assuming that what we found previously (with the time spent loading network libraries, and the auth etc) is the issue..

This is still a concern, as someone could use Cylc for running many small jobs in the same way... We need to find out exactly where..

How can you tell if the job communication is happening via polling?

debug mode will tell you how the messages are being received (zmq or otherwise), in the workflow log..

which is submit --[7s]--> init --[5s]--> exit for a task with script = sleep 1

What's cylc7 look like? looks like that's it.... If so, then at least it's not the Scheduler internals (I suppose)

oliver-sanders commented 2 years ago

How can you tell if the job communication is happening via polling?

I think we now omit the contact file installation for polling platforms.

You can also run the workflow in --debug mode and inspect the job.err log which would record the ZMQ send operations for TCP comms.

oliver-sanders commented 2 years ago

then if each job/task message takes 200ms longer with 3

We run cylc message in a non-blocking way to avoid this issue e.g:

https://github.com/cylc/cylc-flow/blob/9d481bae9c58b994fbb11351812e1f6861e9e5c4/cylc/flow/etc/job.sh#L130

It does wait for the started message:

https://github.com/cylc/cylc-flow/blob/ed19b4b5e574f1c7170e77722a939f6d2dfe8d5c/cylc/flow/etc/job.sh#L163

But only at the end of the job.

MetRonnie commented 2 years ago

Comparing the same tasks (all script = 1s but probably had differing numbers of other tasks running at the same time):

brief_observations_background_atmos

Cylc 7 Cylc 8
submit -> init 1s 4s
init -> exit 1s 3s

ensemble_res_create

Cylc 7 Cylc 8
submit -> init 2s 7s
init -> exit 1s 4s

long_assimilation_create_zap_larger

Cylc 7 Cylc 8
submit -> init 9s 29s
init -> exit 4s 23s
dwsutherland commented 2 years ago

We should create some sort of job profiler, cause that's obviously where the issue is.. We can find out exactly what's taking so long but fiddling with cylc/flow/network/client.py and maybe the job script..

The auth will be doing file reads, and we could print out timing about library loading, client/connection instantiation, and message send/receive.

dwsutherland commented 2 years ago

I think, the only reason going via the UIS could be faster is because it already has libraries loaded and client/connection instantiated.

If we find it is this, one option may be a long lived client for the life of the job (which may mean having the job wrapper in python)... But I don't know, could introduce other delays I suppose...

Surely there's a way to slim down the client end with zmq. (again assuming this is the issue)

oliver-sanders commented 2 years ago

(coming back to this issue after we have have a release out)

oliver-sanders commented 1 year ago

Here's a basic auto-profiling script I wrote for the increment_graph_window investigation:

Basic profiling script ./profile
#!/usr/bin/env bash

set -eu

WORKFLOWS="$(cylc scan --states=all --format=name)"

checkout () {
    local branch="$1"
    (
        cd "$HOME/cylc";
        git checkout "$branch";
    )
}

profile () {
    local branch="$1"
    local branch_path="$(sed 's|/|_|g' <<< "$branch")"
    local tasks="$2"
    local cycles="$3"
    local run_name="${branch_path}__tasks-${tasks}__cycles-${cycles}"
    local workflow_id="diamond/$run_name"

    if [[ "$WORKFLOWS" == *$workflow_id* ]]; then
        echo -e "\n\nUsing cached: branch=$branch tasks=$tasks cycles=$cycles\n\n"
        return
    else
        echo -e "\n\nRunning: branch=$branch tasks=$tasks cycles=$cycles\n\n"
    fi

    cylc clean "${workflow_id}" -y 2>/dev/null || true
    cylc install diamond --run-name="$run_name"

    # profile with cProfile (inner measurement)
    # cylc play "diamond/$run_name" \
    #     --profile \
    #     --no-detach \
    #     -s "TASKS=$tasks" \
    #     -s "CYCLES=$cycles"

    # profile with /usr/bin/time (outer measurement)
    /usr/bin/time \
        -v \
        -o results \
        cylc play \
        "diamond/$run_name" \
        --no-detach \
        -s "TASKS=$tasks" \
        -s "CYCLES=$cycles"
    mv results "$HOME/cylc-run/${workflow_id}/results"
}

run () {
    local branches=(
        'upstream/master'

        'ronnie-local/SPEED-POWER'
        'ronnie/tokens'

        'david/store-tproxy-efficiency'
        'store-tproxy-efficiency-2023-01-24'
        # 'store-tproxy-efficiency-w-platform-call'
        # 'avoid_creating_task_proxies'

        # 'tim/dont_get_platform_too_soon'
        # 'tim/cache_platform_copying'
    )

    for branch in "${branches[@]}"; do
        checkout "$branch"
        profile "$branch" 1 1
        profile "$branch" 5 1
        profile "$branch" 10 1
        profile "$branch" 15 1
        profile "$branch" 20 1
        profile "$branch" 25 1
        profile "$branch" 30 1
        profile "$branch" 35 1
        profile "$branch" 40 1
        profile "$branch" 45 1
        profile "$branch" 50 1
    done
}

run
```
./report.py
from contextlib import suppress
from datetime import timedelta
from pathlib import Path
import sys

import matplotlib.pyplot as plt

def load(filename):
    results = {}
    with open(filename, 'r') as f:
        for line in f:
            key, value = (i.strip() for i in line.rsplit(': ', 1))
            results[key] = value
    return derive(results)

def time_delta(value):
    try:
        minutes, seconds = value.rsplit(':', 1)
        return timedelta(
            minutes=float(minutes),
            seconds=float(seconds),
        ).total_seconds()
    except:
        raise ValueError(value)

CAST_METHODS = (
    float,
    int,
    time_delta,
)

def cast(value):
    for method in CAST_METHODS:
        with suppress(ValueError):
            return method(value)
    return value

timedelta(hours=1, minutes=49, seconds=51).total_seconds

def derive(results):
    _results = {
        key: cast(value)
        for key, value in results.items()
    }
    _results['CPU time (seconds)'] = _results['User time (seconds)'] + _results['System time (seconds)']
    return _results

def report(results):
    print(results)

def plot(results, key, metric):
    data = []
    for branch, _results in results.items():
        x = []
        y = []
        for args, __result in _results:
            # data.append(args[key], result[metric]
            x.append(args[key])
            y.append(__result[metric])
        plt.plot(x, y, label=branch)
    plt.legend()
    plt.xlabel(key)
    plt.ylabel(metric)
    plt.show()

def run():
    workflow = 'diamond'
    branches = [
        'upstream/master',
        'ronnie/tokens',
        'david/store-tproxy-efficiency',
        'ronnie-local/SPEED-POWER'

        # 'david/store-tproxy-efficiency',
        # 'store-tproxy-efficiency-w-platform-call',
        # 'avoid_creating_task_proxies',
        # 'tim/dont_get_platform_too_soon',
        # 'tim/cache_platform_copying',
    ]
    args = [
        {'TASKS': 1, 'CYCLES': 1},
        {'TASKS': 5, 'CYCLES': 1},
        {'TASKS': 10, 'CYCLES': 1},
        {'TASKS': 15, 'CYCLES': 1},
        {'TASKS': 20, 'CYCLES': 1},
        {'TASKS': 25, 'CYCLES': 1},
        {'TASKS': 30, 'CYCLES': 1},
        {'TASKS': 35, 'CYCLES': 1},
        {'TASKS': 40, 'CYCLES': 1},
        {'TASKS': 45, 'CYCLES': 1},
        {'TASKS': 50, 'CYCLES': 1},
    ]
    results = {}
    for branch in branches:
        _results = []
        for argset in args:
            workflow_id = get_id(workflow, branch, **argset)
            try:
                __results = load(Path('~osanders/cylc-run/', workflow_id, 'results').expanduser())
            except Exception as exc:
                print(f'WARNING: skipping {workflow_id}: {exc}')
                continue
            _results.append((argset, __results))
        results[branch] = _results
    plot(results, 'TASKS', 'Elapsed (wall clock) time (h:mm:ss or m:ss)')
    plot(results, 'TASKS', 'CPU time (seconds)')
    plot(results, 'TASKS', 'Maximum resident set size (kbytes)')

def get_id(workflow, branch, **args):
    ret = f'{workflow}/{branch.replace("/", "_")}'
    for key, value in args.items():
        ret += f'__{key.lower()}-{value}'
    return ret

if __name__ == '__main__':
    # run(sys.argv[1])
    run()