Closed arjclark closed 5 years ago
Going through the logs for the first suite, it looks like a new event hit the suite every 2-5 seconds.
As discussed, maybe worth using Python's resource module to do some profiling. (Or just use the ps
command based profiling tool in cylc.scheduler
.)
Will do.
Irritatingly, this is a problem that seems to build up with time. Its taken roughly 12 hours for the first example suite above to build up to a consistently high cpu usage on my machine.
Analysis still ongoing but suggestive of some kind of unintended leak/backlog building up are these increasing numbers of "succeeded with unreported" entries in the logs:
2016-03-09T10:25:58Z WARNING - [glm_um_recon.20280414T1200Z] -Assuming output completed:
glm_um_recon.20280414T1200Z submitted
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280416T0000Z/retrieve/01
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280414T1800Z/glm_um_recon/01
2016-03-09T10:26:18Z WARNING - [glm_um_recon.20280414T1800Z] -Succeeded with unreported outputs:
glm_um_recon.20280414T1800Z submitted
2016-03-09T10:26:18Z WARNING - [glm_um_recon.20280414T1800Z] -Assuming output completed:
glm_um_recon.20280414T1800Z submitted
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280415T0000Z/glm_um_recon/01
2016-03-09T10:26:34Z WARNING - [glm_um_recon.20280415T0000Z] -Succeeded with unreported outputs:
glm_um_recon.20280415T0000Z submitted
2016-03-09T10:26:34Z WARNING - [glm_um_recon.20280415T0000Z] -Assuming output completed:
glm_um_recon.20280415T0000Z submitted
Initially I don't see any of these but they become regular once the suite's at that high cpu point.
Current diagnosis is that #1053 is to blame as the evaluation of next cycle point is expensive after a large number of previous cycles (more likely to be seen in a rapidly cycling suite and more of an issue as it's nearly continuously carrying out those calculations so what might normally be seen as a periodic spike is seen as a continual load).
Keeping open pending #1053 to see if that fixes this or just moves the problem elsewhere.
Underlying cause appears (as suspected) that it is the number of cycle points into the suite that causes the problem, its just that you're more likely to get to a stage where the issue is visible/noticeable in a simple rapidly cycling suite than in one of our more complex suites with clock triggers etc.
See https://github.com/cylc/cylc/pull/1850#issuecomment-221141258 for a good way of reproducing the problem without need for long running of suite to get to that point.
We know we want to do more profiling and @arjclark will do it at some point.
[meeting]
Question for @benfitzpatrick - can we save a recent valid point in the db to avoid the initial hit on restart?
Question for @benfitzpatrick - can we save a recent valid point in the db to avoid the initial hit on restart?
I guess we already have a valid point in the form of the cyclepoint for the created task proxy on restart (or min cycle point from those tasks of the same name in the task pool on restart)? Presumably with a bit of black magic we could re-use that point.
I guess the only problem for that (or any caching) would be if the restart applies a change in graphing that means the existing sequence of dates used is no longer valid.
@hjoliver - do we still want to worry about the initial hit on a restart? Otherwise I think we're suitably performant at this point.
@arjclark - it depends on how big the hit is - have you got a handle on that? Maybe close this issue and put up a new one for the start-up fix. Your task-cycle-point-on-restart (or caching) idea above sounds good to me. Easy enough to check if the old point is still valid for the current sequence settings.
have you got a handle on that?
Haven't got a real handle on it no. I'll take a crack at using the cached point once the monster #2157 pull is merged.
An example:
[scheduling]
initial cycle point = 2018
max active cycle points = 96
[[dependencies]]
[[[T00, T01, T02, T03, T04, T05, T06, T07, T08, T09, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, T23]]]
#[[[PT1H]]]
#[[[T-00]]]
graph = 't1 => t2'
[runtime]
[[t1, t2]]
script = true
This suite uses 100% CPU in my environment while running, spending all its time parsing cycle points and comparing them. (Replacing the explicit sequences with PT1H
results in 75% reduction in cumulative CPU time.)
Ouch.
One underlying problem is that cycle point objects actually store the cycle point values as strings, and in order to compare 2 date-time cycle points, it has to parse the cycle point strings back to TimePoint
objects before comparing them. Although the memoize function is supposed to cache the result of the comparison (by storing the 2 cycle point strings as keys), it does not appear to be effective in this case. (The TimePoint.__cmp__
method still get called an excessive number of times.)
The logic are even stranger for integer cycle points - which are stored as strings for some reason - and we seem to have a whole industry to convert the strings back and forth to integers in order to do some basic integer arithmetic.
I think the code should be modified so we actually store date-time cycle points and intervals as sub-classes of TimePoint
and TimeInterval
objects, and integer cycle points and intervals as sub-classes of int
.
We can also optimise the comparison logic between 2 date-time cycle point objects. E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str
compare them.
I feel a bit guilty about this because I think I helped write the many-cycle-point suite in question... intended to get around weekend problems for hourly cycling.
Although the memoize function is supposed to cache the result of the comparison (by storing the 2 cycle point strings as keys), it does not appear to be effective in this case. (The TimePoint.cmp method still get called an excessive number of times.)
E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str compare them.
The __cmp__
method does compare for string value equality, before moving on to inequality, so that part of the check should be very fast. This was one of the reasons for storing them as strings in the first place, together with easy cheap memoization of string inputs (needs to be hashable). The _iso_point_cmp
is slow but memoized.
https://github.com/cylc/cylc/blob/01331bc/lib/cylc/cycling/iso8601.py#L104-L112
In terms of representing them as actual date-time objects - making the points based on isodatetime point representation might be pretty slow, since there is deliberately no underlying representation of the datetimes as e.g. Unix time, days since 1900, etc. Parsing string values is not too slow, and the other way around requires dumping. TimePoints as-is are not immutable, so the dump value can't just be cached.
The core problem for that suite looks like the (2x) sorting and get_next_point
in release_runahead_tasks
, which has to inter-compare ~96 cycle points. I think those parts themselves could be worked around. However as-is, actually this is going to play around with the memoize limit of 10000 in lib/cylc/cycling/iso8601.py
, so a lot of results need to be recomputed and you see 90% of calls to __cmp__
ending up as _iso_point_cmp
. If you set that limit to 100000 the suite above looks a lot healthier and you get only 10% of calls to __cmp__
having to invoke proper point comparison.
current cylc master, suite stopped after ~60 seconds:
87840422 function calls (86986487 primitive calls) in 63.130 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.005 0.005 62.693 62.693 /opt/cylc/lib/cylc/scheduler.py:1381(run)
315707 0.665 0.000 51.348 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
885365/317833 1.461 0.000 50.685 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
283766 0.367 0.000 49.378 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
25 0.019 0.001 48.376 1.935 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
246 0.212 0.001 44.518 0.181 {sorted}
290988 2.205 0.000 36.942 0.000 /opt/cylc/lib/isodatetime/data.py:1294(__cmp__)
575590/291824 0.492 0.000 36.333 0.000 {cmp}
581976 8.103 0.000 19.292 0.000 /opt/cylc/lib/isodatetime/data.py:1284(get_props)
878200 7.698 0.000 17.875 0.000 /opt/cylc/lib/isodatetime/data.py:1276(copy)
2062066 2.290 0.000 14.042 0.000 /opt/cylc/lib/isodatetime/data.py:617(copy)
......
273867 0.058 0.000 0.058 0.000 {method 'popitem' of 'dict' objects}
(popitem
is called within cylc for the memoize and Sequence-equivalent functions, although I expect elsewhere too)
100,000 memoize limit, suite stopped after ~60 seconds:
56065626 function calls (55705236 primitive calls) in 60.908 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.014 0.014 60.053 60.053 /opt/cylc/lib/cylc/scheduler.py:1381(run)
73 0.067 0.001 25.189 0.345 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
54024 0.226 0.000 23.882 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:510(get_next_point)
1170558 1.026 0.000 20.807 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
74 20.537 0.278 20.537 0.278 {time.sleep}
1441767/1206161 1.164 0.000 19.803 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
164093 1.230 0.000 19.133 0.000 /opt/cylc/lib/isodatetime/data.py:1294(__cmp__)
117803 0.153 0.000 18.758 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
285339/167536 0.235 0.000 13.775 0.000 {cmp}
42801 0.212 0.000 12.323 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:562(get_next_point_on_sequence)
328186 3.913 0.000 9.937 0.000 /opt/cylc/lib/isodatetime/data.py:1284(get_props)
529110 3.762 0.000 9.511 0.000 /opt/cylc/lib/isodatetime/data.py:1276(copy)
...
21051 0.008 0.000 0.008 0.000 {method 'popitem' of 'dict' objects}
The popitem count reduces to 3000 for 1,000,000 memoize limit, but has not much effect on speed.
Basically, lifting the memoize limit in iso8601.py will help (also true to a lesser extent for the Sequence _MAX_CACHED_POINTS
). Ultimately it's then the __cmp__
in isodatetime.data.TimePoint that really matters...
An attempt including isodatetime changes (which may not be optimal generally) gives:
21807510 function calls (21260213 primitive calls) in 60.862 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.023 0.023 60.537 60.537 /opt/cylc/lib/cylc/scheduler.py:1381(run)
116 37.587 0.324 37.587 0.324 {time.sleep}
2000623 1.478 0.000 9.700 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
2280151/2012795 1.457 0.000 8.229 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
115 0.009 0.000 7.699 0.067 /opt/cylc/lib/cylc/scheduler.py:1205(process_task_pool)
116 0.085 0.001 7.055 0.061 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
133678 0.147 0.000 6.895 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
284612 0.238 0.000 5.626 0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:756(point_parse)
317846 2.175 0.000 5.612 0.000 /opt/cylc/lib/isodatetime/data.py:1286(copy)
1345 0.610 0.000 4.796 0.004 {sorted}
115 0.002 0.000 4.483 0.039 /opt/cylc/lib/cylc/task_pool.py:964(spawn_all_tasks)
74 0.001 0.000 4.480 0.061 /opt/cylc/lib/cylc/task_pool.py:951(force_spawn)
117 0.008 0.000 4.468 0.038 /opt/cylc/lib/cylc/suite_db_mgr.py:177(process_queued_ops)
using this diff:
diff --git a/lib/cylc/cycling/iso8601.py b/lib/cylc/cycling/iso8601.py
index 49f2e808f..19b58465f 100755
--- a/lib/cylc/cycling/iso8601.py
+++ b/lib/cylc/cycling/iso8601.py
@@ -34,7 +34,7 @@ from parsec.validate import IllegalValueError
CYCLER_TYPE_ISO8601 = "iso8601"
CYCLER_TYPE_SORT_KEY_ISO8601 = "b"
-MEMOIZE_LIMIT = 10000
+MEMOIZE_LIMIT = 100000
DATE_TIME_FORMAT = "CCYYMMDDThhmm"
EXPANDED_DATE_TIME_FORMAT = "+XCCYYMMDDThhmm"
@@ -321,7 +321,7 @@ class ISO8601Sequence(SequenceBase):
TYPE = CYCLER_TYPE_ISO8601
TYPE_SORT_KEY = CYCLER_TYPE_SORT_KEY_ISO8601
- _MAX_CACHED_POINTS = 100
+ _MAX_CACHED_POINTS = 1000
__slots__ = ('dep_section', 'context_start_point', 'context_end_point',
'offset', '_cached_first_point_values',
diff --git a/lib/isodatetime/data.py b/lib/isodatetime/data.py
index 539c070f5..eaecb0d3a 100644
--- a/lib/isodatetime/data.py
+++ b/lib/isodatetime/data.py
@@ -619,6 +619,15 @@ class TimeZone(Duration):
return TimeZone(hours=self.hours, minutes=self.minutes,
unknown=self.unknown)
+ def __cmp__(self, other):
+ if not isinstance(other, TimeZone):
+ raise TypeError(
+ "Invalid type for comparison: " +
+ "'%s' should be TimePoint." %
+ type(other).__name__
+ )
+ return cmp((self.hours, self.minutes), (other.hours, other.minutes))
+
def __str__(self):
if self.unknown:
return ""
@@ -1301,8 +1310,6 @@ class TimePoint(object):
raise TypeError(
"Cannot compare truncated to non-truncated " +
"TimePoint: %s, %s" % (self, other))
- if self.get_props() == other.get_props():
- return 0
if self.truncated:
for attribute in self.DATA_ATTRIBUTES:
other_attr = getattr(other, attribute)
@@ -1310,8 +1317,9 @@ class TimePoint(object):
if other_attr != self_attr:
return cmp(self_attr, other_attr)
return 0
- other = other.copy()
- other.set_time_zone(self.get_time_zone())
+ if other.get_time_zone() != self.get_time_zone():
+ other = other.copy()
+ other.set_time_zone(self.get_time_zone())
if self.get_is_calendar_date():
my_date = self.get_calendar_date()
other_date = other.get_calendar_date()
I think this would need profiling vs other suites, particularly the get_props commenting out. It does save quite a bit of time here though. Memory use might be an issue perhaps?
We can also optimise the comparison logic between 2 date-time cycle point objects. E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str compare them.
👍
Memory use might be an issue perhaps?
That's certainly my recollection based around why we ended up with the 10000 cap we did.
Another fun (hypothetical) use case for many cycle points would be an inside-out post-processing suite representing multiple incoming forecasts cycling on validity time, pushing model+cycle+leadtime into tasks under that cycle... wasteful but an interesting exercise!
This is never reliably repeated or observed again in real suites. Let's say it will be solved by that giant #3304, and close this down.
Oddly, some quite simple suites which rapidly cycle can result in high cpu usage.
For example, this suite:
was regularly using 100% cpu on one of our servers (N.B. there is no runtime so this is all sleep 10 operations)
Similarly, this suite also results in high server cpu load:
I think these problems are as a result of the rapid turnover of the tasks meaning the cylc daemon is constantly doing something but we should make sure we're not doing anything to make things worse than they could be.