cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
330 stars 93 forks source link

High CPU usage #3512

Open dpmatthews opened 4 years ago

dpmatthews commented 4 years ago

The workflow below is a much simplified version of a real case. Running this workflow results in the scheduler constantly using 100% CPU until stopped. Tested on 7.8.4 & master.

The CPU usage reduces if you reduce the number of DATES or the max active cylc points. The problem also seems to disappear if you remove the [[[R1/{{date}}+PT36H]]] graph section (which I find surprising).

#!jinja2
{% set DATES = ['20170712T0000Z', '20170725T0000Z', '20170729T0000Z', '20170804T0000Z', '20170809T0000Z', '20170812T0000Z', '20170818T0000Z', '20170823T0000Z', '20170902T0000Z', '20170919T0000Z', '20170923T0000Z', '20170930T0000Z', '20171011T0000Z', '20171020T0000Z', '20171028T0000Z', '20171101T0000Z', '20171104T0000Z', '20171115T0000Z', '20171120T0000Z', '20171129T0000Z', '20171207T0000Z', '20171215T0000Z', '20171220T0000Z', '20171224T0000Z', '20171227T0000Z', '20180104T0000Z', '20180109T0000Z', '20180120T0000Z', '20180129T0000Z', '20180206T0000Z', '20180209T0000Z', '20180214T0000Z', '20180217T0000Z', '20180302T0000Z', '20180310T0000Z', '20180313T0000Z', '20180318T0000Z', '20180322T0000Z', '20180325T0000Z', '20180328T0000Z', '20180402T0000Z', '20180408T0000Z', '20180411T0000Z', '20180415T0000Z', '20180418T0000Z', '20180421T0000Z', '20180426T0000Z', '20180429T0000Z', '20180503T0000Z', '20180506T0000Z', '20180510T0000Z', '20180513T0000Z', '20180519T0000Z', '20180522T0000Z', '20180527T0000Z', '20180603T0000Z', '20180607T0000Z', '20180610T0000Z', '20180613T0000Z', '20180619T0000Z', '20180706T0000Z', '20180711T0000Z', '20180714T0000Z', '20180718T0000Z', '20180721T0000Z', '20180726T0000Z', '20180802T0000Z', '20180809T0000Z', '20180813T0000Z', '20180816T0000Z', '20180821T0000Z', '20180829T0000Z', '20180904T0000Z', '20180907T0000Z', '20180910T0000Z', '20181004T0000Z', '20181013T0000Z', '20181019T0000Z', '20181023T0000Z', '20181027T0000Z', '20181030T0000Z', '20181104T0000Z', '20181107T0000Z', '20181110T0000Z', '20181115T0000Z', '20181118T0000Z', '20181122T0000Z', '20181128T0000Z', '20181203T0000Z', '20181206T0000Z', '20181210T0000Z', '20181215T0000Z', '20181219T0000Z', '20181224T0000Z', '20181228T0000Z', '20190102T0000Z', '20190109T0000Z', '20190112T0000Z', '20190119T0000Z', '20190124T0000Z', '20190131T0000Z', '20190208T0000Z', '20190211T0000Z', '20190214T0000Z', '20190217T0000Z', '20190222T0000Z', '20190226T0000Z', '20190301T0000Z', '20190308T0000Z', '20190316T0000Z', '20190325T0000Z', '20190329T0000Z', '20190404T0000Z', '20190408T0000Z', '20190411T0000Z', '20190419T0000Z', '20190422T0000Z', '20190426T0000Z', '20190429T0000Z', '20190503T0000Z', '20190507T0000Z', '20190510T0000Z', '20190513T0000Z', '20190521T0000Z', '20190524T0000Z', '20190527T0000Z', '20190530T0000Z', '20190602T0000Z', '20190605T0000Z', '20190608T0000Z', '20190612T0000Z', '20190617T0000Z', '20190623T0000Z', '20190626T0000Z', '20190629T0000Z', '20190702T0000Z', '20190709T0000Z'] %}
[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = {{DATES|sort|first}}
    final cycle point = {{DATES|sort|last}}
    max active cycle points = 6
    [[dependencies]]
        [[[ R1 ]]]
            graph = "install_cold"
    {%- for date in DATES|sort : %}
        [[[ R1/{{date}}]]]
            graph = "install_cold[^] => fcst => process"
        [[[ R36/{{date}}+PT1H/PT1H ! (T01,T02,T03,T04,T05,T07,T08,T09,T10,T11,T13,T14,T15,T16,T17,T19,T20,T21,T22,T23)]]]
            graph = "process[-PT1H] => process"
    [[[ R36/{{date}}+PT1H/PT1H ! (T00, T06, T12, T18) ]]]
            graph = "process[-PT1H] => process"
    [[[ R1/{{date}}+PT36H ]]]
            graph = "process => housekeep"
    {%- endfor %}
[runtime]
    [[install_cold,fcst,process,housekeep]]
        script = "false"
dwsutherland commented 4 years ago

@dpmatthews: note; until we remove Cylc7 machinery (i.e. state_summary_mgr), or manage to disable, it we can’t gauge the extent of the load at Cylc8..

@hjoliver I wonder how SoD will handle this workflow.

hjoliver commented 4 years ago

This suite (with 2 tasks per cycle) also takes a long time to validate 15s on my laptop, vs 2s for the 1000 task per cycle complex suite).

Basically it generates a pathological number of graph recurrences (550?) that presumably require loads of ongoing isodatetime computation. It is odd that the wheels are still turning when the suite is stalled though, so probably worth profiling to see exactly what is going on there, in case we are doing any unnecessary computation.

@dwsutherland - I'd expect the load to be similar on master if this is all about the isodatetime computations. The old state_summary bits shouldn't make much difference. And I don't think SoD will make much difference here as there are very few tasks in the suite

dwsutherland commented 3 years ago

Is this still a problem?

hjoliver commented 3 years ago

It probably is still a problem. We need to do some profiling.

oliver-sanders commented 2 years ago

The workflow results in a lot of different recurrences (note each TXX exclusion is a separate sequence). These get iterated over a fair bit.

Cylc used to scale very badly to the number of recurrences, here's an old PR which improved the situation, the tests might be helpful: https://github.com/cylc/cylc-flow/pull/2322