cylc / cylc-ui

Web app for monitoring and controlling Cylc workflows
https://cylc.github.io
GNU General Public License v3.0
35 stars 26 forks source link

efficiency: investigate bottleneck #1614

Open oliver-sanders opened 6 months ago

oliver-sanders commented 6 months ago

See also: https://github.com/cylc/cylc-uiserver/issues/547

This workflow has proven to be remarkably difficult for the UIS & UI to handle:

#!Jinja2

{% set members = 10 %}
{% set hours = 100 %}

[scheduler]
    allow implicit tasks = True

[task parameters]
    member = 0..{{members}}
    fcsthr = 0..{{hours}}
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d

[scheduling]
  initial cycle point = 2000
  runahead limit = P3
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """

[runtime]
    [[sniffer]]
        [[[outputs]]]
{% for member in range(0, members + 1) %}
    {% for hour in range(0, hours + 1) %}
            ready_member{{ member | pad(3, 0) }}_fcsthr{{ hour | pad(3, 0) }} = {{ member }}{{ hour }}
    {% endfor %}
{% endfor %}

For more information see: https://cylc.discourse.group/t/slow-load-of-cylc-workflows-disconnects/823/19

Investigation so far has confirmed:

This issue focuses on the UI side of things.

Suggested remediation (UI only, please update with new suggestions):

oliver-sanders commented 6 months ago

IMO, the UI side of this issue is more concerning than the UIS side because UIS delay loads the server, whereas UI delay hits the user's browser.

The bulk of the time is being taken in the data store processing the deltas, this should be the first target for improvement. Profiling required to highlight problem areas, given that the table view is only slightly faster to load than the tree view, family tree computation is unlikely to be the cause.

oliver-sanders commented 5 months ago

Profiling Experiments

1 - JS Profiling

Profile the time it takes to load the tree view for the workflow in the OP with hours turned down to 20. Workflow is started in paused mode.

Results:

The remainder appears to be vuejs.

2 - View Load Time

Open a view, then measure the time it takes to open the same view in a new workspace tab.

Manual timings to the nearest second:

3 - Component Loading

Start with the "simple tree" view and add in the components used by the regular "tree" view one by one, measuring the impact on load time for each.

Using these timings to extract the cost per component:

Note: These costs are for 1'000 tasks, e.g 0.004s per <Task /> icon.

Conclusions

  1. The store is a little sluggish, we should look into possible optimisations
  2. The real killer is the component count in the Tree view.
  3. Potential for easy gains simplifying the expand/collapse system.

Remediation:

oliver-sanders commented 5 months ago

The three optimisations up so far make a reasonable dent in the CPU time.

The time is going into two places:

The data store time is more concerning than the view time as views can be optimised (e.g. table view reduces the number of nodes on screen by pagination, tree view can use a virtual scroller in the future to similar effect) but the data store time will always remain so the store should be the main target of optimisation.