cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
333 stars 94 forks source link

Suites a large number of cycles in can have high cpu usage #1751

Closed arjclark closed 5 years ago

arjclark commented 8 years ago

Oddly, some quite simple suites which rapidly cycle can result in high cpu usage.

For example, this suite:

[scheduling]
    initial cycle point = 2016-02-01
    [[dependencies]]
        [[[P2D]]]
            graph = """
retrieve[-P2D]=>glm_um_recon
glm_um_recon[+P2D] => atmos_main
atmos_main[-P2D] =>atmos_main
retrieve[-P2D] =>retrieve
         """
        [[[PT6H]]]
            graph =   glm_um_recon

was regularly using 100% cpu on one of our servers (N.B. there is no runtime so this is all sleep 10 operations)

Similarly, this suite also results in high server cpu load:

[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = 20100101T0000Z
    [[dependencies]]
        [[[T0030, T1230]]]
            graph = foo[-P1D] => foo => housekeep
[runtime]
    [[foo]]
        command scripting = sleep 5; echo a message; sleep 5
    [[housekeep]]
        command scripting = rose task-run

I think these problems are as a result of the rapid turnover of the tasks meaning the cylc daemon is constantly doing something but we should make sure we're not doing anything to make things worse than they could be.

arjclark commented 8 years ago

Going through the logs for the first suite, it looks like a new event hit the suite every 2-5 seconds.

matthewrmshin commented 8 years ago

As discussed, maybe worth using Python's resource module to do some profiling. (Or just use the ps command based profiling tool in cylc.scheduler.)

arjclark commented 8 years ago

Will do.

arjclark commented 8 years ago

Irritatingly, this is a problem that seems to build up with time. Its taken roughly 12 hours for the first example suite above to build up to a consistently high cpu usage on my machine.

arjclark commented 8 years ago

Analysis still ongoing but suggestive of some kind of unintended leak/backlog building up are these increasing numbers of "succeeded with unreported" entries in the logs:

2016-03-09T10:25:58Z WARNING - [glm_um_recon.20280414T1200Z] -Assuming output completed:  
 glm_um_recon.20280414T1200Z submitted
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280416T0000Z/retrieve/01
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280414T1800Z/glm_um_recon/01
2016-03-09T10:26:18Z WARNING - [glm_um_recon.20280414T1800Z] -Succeeded with unreported outputs:
  glm_um_recon.20280414T1800Z submitted
2016-03-09T10:26:18Z WARNING - [glm_um_recon.20280414T1800Z] -Assuming output completed:  
 glm_um_recon.20280414T1800Z submitted
cylc jobs-submit --debug -- /home/h06/aclark/cylc-run/RAPID/log/job 20280415T0000Z/glm_um_recon/01
2016-03-09T10:26:34Z WARNING - [glm_um_recon.20280415T0000Z] -Succeeded with unreported outputs:
  glm_um_recon.20280415T0000Z submitted
2016-03-09T10:26:34Z WARNING - [glm_um_recon.20280415T0000Z] -Assuming output completed:  
 glm_um_recon.20280415T0000Z submitted

Initially I don't see any of these but they become regular once the suite's at that high cpu point.

arjclark commented 8 years ago

Current diagnosis is that #1053 is to blame as the evaluation of next cycle point is expensive after a large number of previous cycles (more likely to be seen in a rapidly cycling suite and more of an issue as it's nearly continuously carrying out those calculations so what might normally be seen as a periodic spike is seen as a continual load).

Keeping open pending #1053 to see if that fixes this or just moves the problem elsewhere.

arjclark commented 8 years ago

Underlying cause appears (as suspected) that it is the number of cycle points into the suite that causes the problem, its just that you're more likely to get to a stage where the issue is visible/noticeable in a simple rapidly cycling suite than in one of our more complex suites with clock triggers etc.

See https://github.com/cylc/cylc/pull/1850#issuecomment-221141258 for a good way of reproducing the problem without need for long running of suite to get to that point.

benfitzpatrick commented 8 years ago

We know we want to do more profiling and @arjclark will do it at some point.

hjoliver commented 8 years ago

[meeting]

Question for @benfitzpatrick - can we save a recent valid point in the db to avoid the initial hit on restart?

arjclark commented 8 years ago

Question for @benfitzpatrick - can we save a recent valid point in the db to avoid the initial hit on restart?

I guess we already have a valid point in the form of the cyclepoint for the created task proxy on restart (or min cycle point from those tasks of the same name in the task pool on restart)? Presumably with a bit of black magic we could re-use that point.

I guess the only problem for that (or any caching) would be if the restart applies a change in graphing that means the existing sequence of dates used is no longer valid.

arjclark commented 7 years ago

@hjoliver - do we still want to worry about the initial hit on a restart? Otherwise I think we're suitably performant at this point.

hjoliver commented 7 years ago

@arjclark - it depends on how big the hit is - have you got a handle on that? Maybe close this issue and put up a new one for the start-up fix. Your task-cycle-point-on-restart (or caching) idea above sounds good to me. Easy enough to check if the old point is still valid for the current sequence settings.

arjclark commented 7 years ago

have you got a handle on that?

Haven't got a real handle on it no. I'll take a crack at using the cached point once the monster #2157 pull is merged.

matthewrmshin commented 6 years ago

An example:

[scheduling]
    initial cycle point = 2018
    max active cycle points = 96
    [[dependencies]]
        [[[T00, T01, T02, T03, T04, T05, T06, T07, T08, T09, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, T23]]]
        #[[[PT1H]]]
        #[[[T-00]]]
            graph = 't1 => t2'
[runtime]
    [[t1, t2]]
        script = true

This suite uses 100% CPU in my environment while running, spending all its time parsing cycle points and comparing them. (Replacing the explicit sequences with PT1H results in 75% reduction in cumulative CPU time.)

hjoliver commented 6 years ago

Ouch.

matthewrmshin commented 6 years ago

One underlying problem is that cycle point objects actually store the cycle point values as strings, and in order to compare 2 date-time cycle points, it has to parse the cycle point strings back to TimePoint objects before comparing them. Although the memoize function is supposed to cache the result of the comparison (by storing the 2 cycle point strings as keys), it does not appear to be effective in this case. (The TimePoint.__cmp__ method still get called an excessive number of times.)

The logic are even stranger for integer cycle points - which are stored as strings for some reason - and we seem to have a whole industry to convert the strings back and forth to integers in order to do some basic integer arithmetic.

I think the code should be modified so we actually store date-time cycle points and intervals as sub-classes of TimePoint and TimeInterval objects, and integer cycle points and intervals as sub-classes of int.

We can also optimise the comparison logic between 2 date-time cycle point objects. E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str compare them.

benfitzpatrick commented 6 years ago

I feel a bit guilty about this because I think I helped write the many-cycle-point suite in question... intended to get around weekend problems for hourly cycling.

Although the memoize function is supposed to cache the result of the comparison (by storing the 2 cycle point strings as keys), it does not appear to be effective in this case. (The TimePoint.cmp method still get called an excessive number of times.)

E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str compare them.

The __cmp__ method does compare for string value equality, before moving on to inequality, so that part of the check should be very fast. This was one of the reasons for storing them as strings in the first place, together with easy cheap memoization of string inputs (needs to be hashable). The _iso_point_cmp is slow but memoized.

https://github.com/cylc/cylc/blob/01331bc/lib/cylc/cycling/iso8601.py#L104-L112

In terms of representing them as actual date-time objects - making the points based on isodatetime point representation might be pretty slow, since there is deliberately no underlying representation of the datetimes as e.g. Unix time, days since 1900, etc. Parsing string values is not too slow, and the other way around requires dumping. TimePoints as-is are not immutable, so the dump value can't just be cached.

The core problem for that suite looks like the (2x) sorting and get_next_point in release_runahead_tasks, which has to inter-compare ~96 cycle points. I think those parts themselves could be worked around. However as-is, actually this is going to play around with the memoize limit of 10000 in lib/cylc/cycling/iso8601.py, so a lot of results need to be recomputed and you see 90% of calls to __cmp__ ending up as _iso_point_cmp. If you set that limit to 100000 the suite above looks a lot healthier and you get only 10% of calls to __cmp__ having to invoke proper point comparison.

current cylc master, suite stopped after ~60 seconds:

         87840422 function calls (86986487 primitive calls) in 63.130 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005   62.693   62.693 /opt/cylc/lib/cylc/scheduler.py:1381(run)
   315707    0.665    0.000   51.348    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
885365/317833    1.461    0.000   50.685    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
   283766    0.367    0.000   49.378    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
       25    0.019    0.001   48.376    1.935 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
      246    0.212    0.001   44.518    0.181 {sorted}
   290988    2.205    0.000   36.942    0.000 /opt/cylc/lib/isodatetime/data.py:1294(__cmp__)
575590/291824    0.492    0.000   36.333    0.000 {cmp}
   581976    8.103    0.000   19.292    0.000 /opt/cylc/lib/isodatetime/data.py:1284(get_props)
   878200    7.698    0.000   17.875    0.000 /opt/cylc/lib/isodatetime/data.py:1276(copy)
  2062066    2.290    0.000   14.042    0.000 /opt/cylc/lib/isodatetime/data.py:617(copy)
......
   273867    0.058    0.000    0.058    0.000 {method 'popitem' of 'dict' objects}

(popitem is called within cylc for the memoize and Sequence-equivalent functions, although I expect elsewhere too)

100,000 memoize limit, suite stopped after ~60 seconds:

         56065626 function calls (55705236 primitive calls) in 60.908 CPU seconds

   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.014    0.014   60.053   60.053 /opt/cylc/lib/cylc/scheduler.py:1381(run)
       73    0.067    0.001   25.189    0.345 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
    54024    0.226    0.000   23.882    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:510(get_next_point)
  1170558    1.026    0.000   20.807    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
       74   20.537    0.278   20.537    0.278 {time.sleep}
1441767/1206161    1.164    0.000   19.803    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
   164093    1.230    0.000   19.133    0.000 /opt/cylc/lib/isodatetime/data.py:1294(__cmp__)
   117803    0.153    0.000   18.758    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
285339/167536    0.235    0.000   13.775    0.000 {cmp}
    42801    0.212    0.000   12.323    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:562(get_next_point_on_sequence)
   328186    3.913    0.000    9.937    0.000 /opt/cylc/lib/isodatetime/data.py:1284(get_props)
   529110    3.762    0.000    9.511    0.000 /opt/cylc/lib/isodatetime/data.py:1276(copy)
...
    21051    0.008    0.000    0.008    0.000 {method 'popitem' of 'dict' objects}

The popitem count reduces to 3000 for 1,000,000 memoize limit, but has not much effect on speed.

Basically, lifting the memoize limit in iso8601.py will help (also true to a lesser extent for the Sequence _MAX_CACHED_POINTS). Ultimately it's then the __cmp__ in isodatetime.data.TimePoint that really matters...

An attempt including isodatetime changes (which may not be optimal generally) gives:

         21807510 function calls (21260213 primitive calls) in 60.862 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.023    0.023   60.537   60.537 /opt/cylc/lib/cylc/scheduler.py:1381(run)
      116   37.587    0.324   37.587    0.324 {time.sleep}
  2000623    1.478    0.000    9.700    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:104(__cmp__)
2280151/2012795    1.457    0.000    8.229    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:72(_wrapper)
      115    0.009    0.000    7.699    0.067 /opt/cylc/lib/cylc/scheduler.py:1205(process_task_pool)
      116    0.085    0.001    7.055    0.061 /opt/cylc/lib/cylc/task_pool.py:231(release_runahead_tasks)
   133678    0.147    0.000    6.895    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:146(_iso_point_cmp)
   284612    0.238    0.000    5.626    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:756(point_parse)
   317846    2.175    0.000    5.612    0.000 /opt/cylc/lib/isodatetime/data.py:1286(copy)
     1345    0.610    0.000    4.796    0.004 {sorted}
      115    0.002    0.000    4.483    0.039 /opt/cylc/lib/cylc/task_pool.py:964(spawn_all_tasks)
       74    0.001    0.000    4.480    0.061 /opt/cylc/lib/cylc/task_pool.py:951(force_spawn)
      117    0.008    0.000    4.468    0.038 /opt/cylc/lib/cylc/suite_db_mgr.py:177(process_queued_ops)

using this diff:

diff --git a/lib/cylc/cycling/iso8601.py b/lib/cylc/cycling/iso8601.py
index 49f2e808f..19b58465f 100755
--- a/lib/cylc/cycling/iso8601.py
+++ b/lib/cylc/cycling/iso8601.py
@@ -34,7 +34,7 @@ from parsec.validate import IllegalValueError
 CYCLER_TYPE_ISO8601 = "iso8601"
 CYCLER_TYPE_SORT_KEY_ISO8601 = "b"

-MEMOIZE_LIMIT = 10000
+MEMOIZE_LIMIT = 100000

 DATE_TIME_FORMAT = "CCYYMMDDThhmm"
 EXPANDED_DATE_TIME_FORMAT = "+XCCYYMMDDThhmm"
@@ -321,7 +321,7 @@ class ISO8601Sequence(SequenceBase):

     TYPE = CYCLER_TYPE_ISO8601
     TYPE_SORT_KEY = CYCLER_TYPE_SORT_KEY_ISO8601
-    _MAX_CACHED_POINTS = 100
+    _MAX_CACHED_POINTS = 1000

     __slots__ = ('dep_section', 'context_start_point', 'context_end_point',
                  'offset', '_cached_first_point_values',
diff --git a/lib/isodatetime/data.py b/lib/isodatetime/data.py
index 539c070f5..eaecb0d3a 100644
--- a/lib/isodatetime/data.py
+++ b/lib/isodatetime/data.py
@@ -619,6 +619,15 @@ class TimeZone(Duration):
         return TimeZone(hours=self.hours, minutes=self.minutes,
                         unknown=self.unknown)

+    def __cmp__(self, other):
+        if not isinstance(other, TimeZone):
+            raise TypeError(
+                "Invalid type for comparison: " +
+                "'%s' should be TimePoint." %
+                type(other).__name__
+            )
+        return cmp((self.hours, self.minutes), (other.hours, other.minutes))
+
     def __str__(self):
         if self.unknown:
             return ""
@@ -1301,8 +1310,6 @@ class TimePoint(object):
             raise TypeError(
                 "Cannot compare truncated to non-truncated " +
                 "TimePoint: %s, %s" % (self, other))
-        if self.get_props() == other.get_props():
-            return 0
         if self.truncated:
             for attribute in self.DATA_ATTRIBUTES:
                 other_attr = getattr(other, attribute)
@@ -1310,8 +1317,9 @@ class TimePoint(object):
                 if other_attr != self_attr:
                     return cmp(self_attr, other_attr)
             return 0
-        other = other.copy()
-        other.set_time_zone(self.get_time_zone())
+        if other.get_time_zone() != self.get_time_zone():
+            other = other.copy()
+            other.set_time_zone(self.get_time_zone())
         if self.get_is_calendar_date():
             my_date = self.get_calendar_date()
             other_date = other.get_calendar_date()

I think this would need profiling vs other suites, particularly the get_props commenting out. It does save quite a bit of time here though. Memory use might be an issue perhaps?

benfitzpatrick commented 6 years ago

We can also optimise the comparison logic between 2 date-time cycle point objects. E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply str compare them.

👍

arjclark commented 6 years ago

Memory use might be an issue perhaps?

That's certainly my recollection based around why we ended up with the 10000 cap we did.

benfitzpatrick commented 6 years ago

Another fun (hypothetical) use case for many cycle points would be an inside-out post-processing suite representing multiple incoming forecasts cycling on validity time, pushing model+cycle+leadtime into tasks under that cycle... wasteful but an interesting exercise!

matthewrmshin commented 5 years ago

This is never reliably repeated or observed again in real suites. Let's say it will be solved by that giant #3304, and close this down.