Open oliver-sanders opened 2 years ago
Yeah this should be a priority once rc1 is out. (Feature freeze, more or less, then bug-fix, profile, optimize...)
Cylc 8 ought to be a lot more efficient, but we haven't really proved that yet.
Here is my first go at profiling:
Datetime cycling example from tutorials; tasks with no scripts/sleeps.
flow.cylc
7.8.10 | 8.0rc1.dev0[^1] | |
---|---|---|
Wall clock time | 0:01:46 | 0:04:03 |
User + Sys time (s) | 21.0 | 31.4 |
% of CPU this job got | 19 | 12 |
Max resident set size (kB) | 27,464 | 48,704 |
[^1]: Commit: 72781934, Python 3.9.7
A while ago I had taken the "Complex" suite from Oliver's ~/cylc-run
and updated it to Cylc 8 syntax: https://gist.github.com/MetRonnie/98d42d0b090b1fee7f24a000b90be0af
Tasks with 1s sleeps (note for some reason I used env-script
instead of script
in [runtime][root]
)
7.8.10 | 8.0rc1.dev0[^1] | |
---|---|---|
Wall clock time | 0:25:18 | 1:11:06 |
User + Sys time (s) | 469.3 | 771.9 |
% of CPU this job got | 30 | 18 |
Max resident set size (kB) | 71,664 | 101,748 |
P.S. An indication of how verbose the logging has become (at log level INFO) - I noticed there were 9 log files for Cylc 8 vs 3 for Cylc 7. And the first 4 of the log files for Cylc 8 had been deleted as the default is to only keep 5 at any time.
The Cylc 7 results are as expected, the Cylc 8 results less so, will need to investigate. To help reproduce these results:
complex
suite?~ hardcoded to 20150220T00
Ran on my VLD box while not using it (with only a VSCode window with a few tabs, a file manager window and 3 terminal sessions open). Cylc 7 was (/usr/bin/python
) 2.7.5, Cylc 8 was (~/.conda/envs/cylc8/bin/python
) 3.9.7.
Also, I did the Cylc 8 complex test while keeping the VDI connection open during the middle of the day (doing stuff in Windows in the meantime). Whereas for the Cylc 7 complex test I started it last thing in the evening and closed the VDI connection to let it run overnight. Don't know if a difference in terms of load on the VLD servers depending on time of day, or whether the virtual desktop connection was open or not would affect the results.
Profiled the small datetime-cycling workflow using memory-profiler
Validation of the complex workflow:
C.f. ~100 MB max RSS during running of the workflow
I've profiled cylc validate complex
two ways to give an inner and an outer measurement.
The program as a whole weighs ~3.5 times more than the config alone adding up to ~38Mb.
Measuring the overheads of a "Hello World" Python script I get a number of 6909952 b, subtracting that from the 38Mb leaves us with 30Mb of un accounted memory.
51855360 - (14322992 + 6909952) = 30622416 b
This is approximately 60% of the total usage.
From introspection it looks like the bulk of that is str
objects.
Python does fill up memory with a lot of _code
objects representing the system code itself, I have seen these get quite large (Python2.6). Will take a bit more investigation to determine what this 30Mb is being used for.
Obtaining the inner measurement:
diff --git a/cylc/flow/scripts/validate.py b/cylc/flow/scripts/validate.py
index 53a9b824f..d1c134323 100755
--- a/cylc/flow/scripts/validate.py
+++ b/cylc/flow/scripts/validate.py
@@ -168,3 +168,5 @@ def main(parser: COP, options: 'Values', reg: str) -> None:
print(cparse('<green>Valid for cylc-%s</green>' % CYLC_VERSION))
profiler.stop()
+ from pympler.asizeof import asized
+ print(asized(cfg, detail=2).format(depth=2))
$ cylc validate complex
...
<cylc.flow.config.WorkflowConfig object at 0x7f80e9030ad0> size=14322992 flat=64
__dict__ size=14322928 flat=416
[V] taskdefs: {'ensemble_install_startdata_cold': <c....def.TaskDef object at 0x7f80eb1dce60>} size=9974040 flat=36976
[V] edges: {<cylc.flow.cycling.iso8601.ISO8601Seq....mble_forecast_012_tN', False, False)}} size=2080024 flat=248
[V] pcfg: <cylc.flow.cfgspec.workflow.RawWorkflowConfig object at 0x7f80e7c8f6d0> size=1615208 flat=64
[V] runtime: {'parents': {'root': [], 'HPC': ['root....fication_to_highway_happyland_yay2']}} size=625096 flat=248
[V] ns_defn_order: ['root', 'HPC', 'HPC_BACKUP', 'CORETYP....tification_to_highway_happyland_yay2'] size=10040 flat=10040
[V] leaves: ['archive_logs', 'brief_archive', 'bri....', 'recover1', 'recover2', 'recover3'] size=9200 flat=9200
[V] clock_offsets: {'long_start_00': <cylc.flow.cycling.i....601Interval object at 0x7f80ea82f5d0>} size=1736 flat=656
[V] options: <Values at 0x7f80e565c690: {'check_cir....emplatevars_file': None, 'icp': None}> size=1688 flat=64
Obtaining the outer measurement:
$ # (after reversing the above diff)
$ /usr/bin/time -lp cylc validate complex # note: BSD time args, use -v for GNU time
...
real 1.96
user 1.85
sys 0.10
51855360 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
24744 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
8 voluntary context switches
170 involuntary context switches
Obtaining the Python baseline:
$ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real 0.03
user 0.02
sys 0.00
6909952 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1706 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
2 involuntary context switches
Simple test comparing Python with and without importing cylc.flow.scheduler.Scheduler
shows the Cylc code (including dependencies) occupies 33.7Mb in memory at runtime. A similar number to the one in the post above.
40,615,936 b - 6,909,952 b = 33,705,984 b
$ cat test.py
from cylc.flow.scheduler import Scheduler
$ /usr/bin/time -lp python test.py
real 0.47
user 0.34
sys 0.10
40615936 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
10017 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
49 voluntary context switches
73 involuntary context switches
$ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real 0.03
user 0.02
sys 0.00
6909952 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1706 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
1 involuntary context switches
Using /usr/bin/time
on cylc validate datetime-cycling
:
If I strip cylc/flow/scripts/validate.py
down to just
def main(*a, **k) -> None:
"""cylc validate CLI."""
return
Max RSS: 19,248 kB
Adding the imports back in
Max RSS: 31,520 kB
And with the unmodified file I get max RSS: 35,632 kB
In graphical form:
For comparison in Cylc 7 the full cylc validate
has max RSS: 16,428 kB
cylc play datetime-cycling --profile -n
If <method 'poll' of 'select.epoll' objects>
is sleeps, are we sleeping too much? Considering that Cylc 7 was able to run it in half the wall clock time.
Zooming in:
(Note: everything under update_profiler_logs
on the left can be ignored as that doesn't run when not using --profile
)
cylc play complex --profile -n
:Zooming in to the left (red squiggly bit is profiling itself so can be ignored):
Zooming in to the centre
Possible memory leak? (This is not the full workflow run - I stopped it before completion)
If <method 'poll' of 'select.epoll' objects> is sleeps, are we sleeping too much? Considering that Cylc 7 was able to run it in half the wall clock time.
That's mostly await asyncio.sleep
which is to be expected. If the workflow takes longer to complete (which id did) there will be more iterations of the main loop, so more sleep between iterations as you said.
We can reduce the main loop sleep for more responsiveness, however, that raises CPU usage.
Here's the complex workflow running with --main-loop 'log memory'
The largest object @12Mb is the config, followed by the task pool@4Mb, nothing to explain the >100Mb reported which suggests the remaining memory is perhaps not associated directly with the Scheduler object?
The orange line sloping upwards which looks like a memory leak is the data that generated this graph accumulating.
(Scheduler start is everything up to the start of the main loop)
Looking into these reveals:
P.S. An indication of how verbose the logging has become (at log level INFO)
That was deliberate (on my part anyway) to help with Cylc 8 development. Expected to pare it back a lot once things stabilized a bit.
SnakeViz for cylc play datetime-cycling --profile -n
@MetRonnie - did you do this for Cylc 7 too?
No, Python 2 vs 3 might make it difficult but I'll try
The small datetime-cycling workflow run over dozens of cycles. Memory usage creeping up monotonically
The Cylc 7 cylc run --profile
didn't seem to produce a .prof
file to use with Snakeviz. However, the top functions by cumulative time are (for the complex suite)
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.230 0.230 1513.650 1513.650 ~/github/cylc-7/lib/cylc/scheduler.py:1603(run)
1136 4.405 0.004 721.388 0.635 ~/github/cylc-7/lib/cylc/suite_db_mgr.py:187(process_queued_ops)
1135 0.009 0.000 721.269 0.635 ~/github/cylc-7/lib/cylc/scheduler.py:1275(process_suite_db_queue)
2272 2.048 0.001 674.076 0.297 ~/github/cylc-7/lib/cylc/rundb.py:401(execute_queued_items)
18876 0.160 0.000 453.895 0.024 ~/github/cylc-7/lib/cylc/rundb.py:454(_execute_stmt)
18876 433.598 0.023 433.598 0.023 {method 'executemany' of 'sqlite3.Connection' objects}
864 0.387 0.000 229.521 0.266 ~/github/cylc-7/lib/cylc/scheduler.py:1245(process_task_pool)
1942 212.488 0.109 212.488 0.109 {method 'commit' of 'sqlite3.Connection' objects}
1135 199.721 0.176 199.721 0.176 {time.sleep}
864 0.171 0.000 168.005 0.194 ~/github/cylc-7/lib/cylc/task_job_mgr.py:180(submit_task_jobs)
864 0.269 0.000 159.935 0.185 ~/github/cylc-7/lib/cylc/task_job_mgr.py:159(prep_submit_task_jobs)
3480 0.523 0.000 159.659 0.046 ~/github/cylc-7/lib/cylc/task_job_mgr.py:733(_prep_submit_task_job)
10841 0.091 0.000 144.661 0.013 /usr/lib64/python2.7/subprocess.py:475(_eintr_retry_call)
3480 8.404 0.002 127.785 0.037 ~/github/cylc-7/lib/cylc/job_file.py:44(write)
1136 0.738 0.001 120.515 0.106 ~/github/cylc-7/lib/cylc/scheduler.py:1669(update_state_summary)
5039 0.334 0.000 86.496 0.017 /usr/lib64/python2.7/subprocess.py:650(__init__)
5039 1.094 0.000 85.533 0.017 /usr/lib64/python2.7/subprocess.py:1188(_execute_child)
1136 0.061 0.000 83.884 0.074 ~/github/cylc-7/lib/cylc/subprocpool.py:172(process)
1134 0.048 0.000 76.739 0.068 ~/github/cylc-7/lib/cylc/scheduler.py:1590(update_profiler_logs)
398 0.009 0.000 75.389 0.189 ~/github/cylc-7/lib/cylc/subprocpool.py:158(_proc_exit)
398 0.012 0.000 75.338 0.189 ~/github/cylc-7/lib/cylc/subprocpool.py:358(_run_command_exit)
398 0.010 0.000 75.304 0.189 ~/github/cylc-7/lib/cylc/task_job_mgr.py:690(_submit_task_jobs_callback)
398 0.196 0.000 75.294 0.189 ~/github/cylc-7/lib/cylc/task_job_mgr.py:476(_manip_task_jobs_callback)
1134 0.127 0.000 74.937 0.066 ~/github/cylc-7/lib/cylc/scheduler.py:2031(_update_cpu_usage)
20930 0.114 0.000 73.991 0.004 /usr/lib64/python2.7/logging/__init__.py:1249(_log)
956 13.481 0.014 73.941 0.077 ~/github/cylc-7/lib/cylc/state_summary_mgr.py:50(update)
20930 0.069 0.000 72.642 0.003 /usr/lib64/python2.7/logging/__init__.py:1270(handle)
20930 0.171 0.000 72.549 0.003 /usr/lib64/python2.7/logging/__init__.py:1302(callHandlers)
41846 0.142 0.000 72.377 0.002 /usr/lib64/python2.7/logging/__init__.py:736(handle)
10440 0.271 0.000 72.236 0.007 ~/github/cylc-7/lib/cylc/task_events_mgr.py:286(process_message)
5534 71.470 0.013 71.470 0.013 {posix.read}
So a lot of CPU time taken up by DB ops in Cylc 7 too
Not sure I understand the time difference:
0:25:18
vs 1:11:06
Seems kinda impossible... It's almost as if the job communication isn't working and it's switched to polling mode...
Is the memory overhead the loaded libraries?
How can you tell if the job communication is happening via polling?
I'm getting
# CYLC TASK ENVIRONMENT:
export CYLC_TASK_COMMS_METHOD=zmq
in job scripts.
Example job.status
:
CYLC_JOB_RUNNER_NAME=background
CYLC_JOB_ID=17391
CYLC_JOB_RUNNER_SUBMIT_TIME=2022-02-01T18:54:42Z
CYLC_JOB_PID=17391
CYLC_JOB_INIT_TIME=2022-02-01T18:54:49Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-01T18:54:54Z
which is submit --[7s]--> init --[5s]--> exit
for a task with script = sleep 1
let's say there's 5000 sequential tasks (maybe more in parallel), then if each job/task message takes 200ms longer with 3 messages each job. That's 3000sec or 50min extra, which would account for the difference.. (I don't know the exact numbers).
I'm, of course, assuming that what we found previously (with the time spent loading network libraries, and the auth etc) is the issue..
This is still a concern, as someone could use Cylc for running many small jobs in the same way... We need to find out exactly where..
How can you tell if the job communication is happening via polling?
debug mode will tell you how the messages are being received (zmq or otherwise), in the workflow log..
which is
submit --[7s]--> init --[5s]--> exit
for a task withscript = sleep 1
What's cylc7 look like? looks like that's it.... If so, then at least it's not the Scheduler internals (I suppose)
How can you tell if the job communication is happening via polling?
I think we now omit the contact file installation for polling platforms.
You can also run the workflow in --debug
mode and inspect the job.err
log which would record the ZMQ send
operations for TCP comms.
then if each job/task message takes 200ms longer with 3
We run cylc message
in a non-blocking way to avoid this issue e.g:
It does wait for the started message:
But only at the end of the job.
Comparing the same tasks (all script = 1s
but probably had differing numbers of other tasks running at the same time):
brief_observations_background_atmos
Cylc 7 | Cylc 8 | |
---|---|---|
submit -> init | 1s | 4s |
init -> exit | 1s | 3s |
ensemble_res_create
Cylc 7 | Cylc 8 | |
---|---|---|
submit -> init | 2s | 7s |
init -> exit | 1s | 4s |
long_assimilation_create_zap_larger
Cylc 7 | Cylc 8 | |
---|---|---|
submit -> init | 9s | 29s |
init -> exit | 4s | 23s |
We should create some sort of job profiler, cause that's obviously where the issue is.. We can find out exactly what's taking so long but fiddling with cylc/flow/network/client.py
and maybe the job script..
The auth will be doing file reads, and we could print out timing about library loading, client/connection instantiation, and message send/receive.
I think, the only reason going via the UIS could be faster is because it already has libraries loaded and client/connection instantiated.
If we find it is this, one option may be a long lived client for the life of the job (which may mean having the job wrapper in python)... But I don't know, could introduce other delays I suppose...
Surely there's a way to slim down the client end with zmq. (again assuming this is the issue)
(coming back to this issue after we have have a release out)
Here's a basic auto-profiling script I wrote for the increment_graph_window
investigation:
#!/usr/bin/env bash set -eu WORKFLOWS="$(cylc scan --states=all --format=name)" checkout () { local branch="$1" ( cd "$HOME/cylc"; git checkout "$branch"; ) } profile () { local branch="$1" local branch_path="$(sed 's|/|_|g' <<< "$branch")" local tasks="$2" local cycles="$3" local run_name="${branch_path}__tasks-${tasks}__cycles-${cycles}" local workflow_id="diamond/$run_name" if [[ "$WORKFLOWS" == *$workflow_id* ]]; then echo -e "\n\nUsing cached: branch=$branch tasks=$tasks cycles=$cycles\n\n" return else echo -e "\n\nRunning: branch=$branch tasks=$tasks cycles=$cycles\n\n" fi cylc clean "${workflow_id}" -y 2>/dev/null || true cylc install diamond --run-name="$run_name" # profile with cProfile (inner measurement) # cylc play "diamond/$run_name" \ # --profile \ # --no-detach \ # -s "TASKS=$tasks" \ # -s "CYCLES=$cycles" # profile with /usr/bin/time (outer measurement) /usr/bin/time \ -v \ -o results \ cylc play \ "diamond/$run_name" \ --no-detach \ -s "TASKS=$tasks" \ -s "CYCLES=$cycles" mv results "$HOME/cylc-run/${workflow_id}/results" } run () { local branches=( 'upstream/master' 'ronnie-local/SPEED-POWER' 'ronnie/tokens' 'david/store-tproxy-efficiency' 'store-tproxy-efficiency-2023-01-24' # 'store-tproxy-efficiency-w-platform-call' # 'avoid_creating_task_proxies' # 'tim/dont_get_platform_too_soon' # 'tim/cache_platform_copying' ) for branch in "${branches[@]}"; do checkout "$branch" profile "$branch" 1 1 profile "$branch" 5 1 profile "$branch" 10 1 profile "$branch" 15 1 profile "$branch" 20 1 profile "$branch" 25 1 profile "$branch" 30 1 profile "$branch" 35 1 profile "$branch" 40 1 profile "$branch" 45 1 profile "$branch" 50 1 done } run ```./report.py
from contextlib import suppress from datetime import timedelta from pathlib import Path import sys import matplotlib.pyplot as plt def load(filename): results = {} with open(filename, 'r') as f: for line in f: key, value = (i.strip() for i in line.rsplit(': ', 1)) results[key] = value return derive(results) def time_delta(value): try: minutes, seconds = value.rsplit(':', 1) return timedelta( minutes=float(minutes), seconds=float(seconds), ).total_seconds() except: raise ValueError(value) CAST_METHODS = ( float, int, time_delta, ) def cast(value): for method in CAST_METHODS: with suppress(ValueError): return method(value) return value timedelta(hours=1, minutes=49, seconds=51).total_seconds def derive(results): _results = { key: cast(value) for key, value in results.items() } _results['CPU time (seconds)'] = _results['User time (seconds)'] + _results['System time (seconds)'] return _results def report(results): print(results) def plot(results, key, metric): data = [] for branch, _results in results.items(): x = [] y = [] for args, __result in _results: # data.append(args[key], result[metric] x.append(args[key]) y.append(__result[metric]) plt.plot(x, y, label=branch) plt.legend() plt.xlabel(key) plt.ylabel(metric) plt.show() def run(): workflow = 'diamond' branches = [ 'upstream/master', 'ronnie/tokens', 'david/store-tproxy-efficiency', 'ronnie-local/SPEED-POWER' # 'david/store-tproxy-efficiency', # 'store-tproxy-efficiency-w-platform-call', # 'avoid_creating_task_proxies', # 'tim/dont_get_platform_too_soon', # 'tim/cache_platform_copying', ] args = [ {'TASKS': 1, 'CYCLES': 1}, {'TASKS': 5, 'CYCLES': 1}, {'TASKS': 10, 'CYCLES': 1}, {'TASKS': 15, 'CYCLES': 1}, {'TASKS': 20, 'CYCLES': 1}, {'TASKS': 25, 'CYCLES': 1}, {'TASKS': 30, 'CYCLES': 1}, {'TASKS': 35, 'CYCLES': 1}, {'TASKS': 40, 'CYCLES': 1}, {'TASKS': 45, 'CYCLES': 1}, {'TASKS': 50, 'CYCLES': 1}, ] results = {} for branch in branches: _results = [] for argset in args: workflow_id = get_id(workflow, branch, **argset) try: __results = load(Path('~osanders/cylc-run/', workflow_id, 'results').expanduser()) except Exception as exc: print(f'WARNING: skipping {workflow_id}: {exc}') continue _results.append((argset, __results)) results[branch] = _results plot(results, 'TASKS', 'Elapsed (wall clock) time (h:mm:ss or m:ss)') plot(results, 'TASKS', 'CPU time (seconds)') plot(results, 'TASKS', 'Maximum resident set size (kbytes)') def get_id(workflow, branch, **args): ret = f'{workflow}/{branch.replace("/", "_")}' for key, value in args.items(): ret += f'__{key.lower()}-{value}' return ret if __name__ == '__main__': # run(sys.argv[1]) run()
Cylc 7 had an on board automated profile battery which provided us with insights into Cylc's scaling characteristics.
This profile battery will need a total re-write for Cylc 8, there is an issue for that here #38, however, this is a relatively large job and one for which we don't have time at the moment so I suggest manually running some of the old scaling tests to give us an idea of how scaling has changed and how Cylc 8 compares againt Cylc 7.
Quick overview of the main scaling dimensions of Cylc:
These dimensions are targeted by suites which were located in
etc/dev-suites
in the Cylc 7 source code.For MO people some (rather dated) results from the old automated profile-battery can be found here: https://www-nwp/~osanders/profiling-results/2016/
To test scaling you will need to run the same experiment with different template variables.
To profile them I suggest using
/usr/bin/time
as we have a good level of understanding of this command and results will be comparable to prior tests.In the output we are interested in:
For fair tests you need to run all experiments on the same hardware and to ensure the machine is idle. Running on a cluster may be a good idea but you may need to claim the whole node to meet these conditions (with the obvious resource wastage implications).
We don't have any up-to-date results for Cylc 7 so you will need to profile that too, however, you can use the automated profile battery there if you like. I think you run it like so: