StackStorm / orquesta

Orquesta is a graph based workflow engine for StackStorm. Questions? https://github.com/StackStorm/st2/discussions
https://docs.stackstorm.com/orquesta/
Apache License 2.0
98 stars 39 forks source link

remove some deepcopy to speed up workflow conductor #256

Open guzzijones opened 1 year ago

guzzijones commented 1 year ago
  1. removing deepcopy where we can
  2. change deepcopy where still needed to custom copy command to increase speed using instead a shallow copy
  3. add some data classes * tsc meeting comment
guzzijones commented 1 year ago

The use of 'latest' is causing my tox unit testing to fail it appears.

  orquesta.exceptions.PluginFactoryError: Unable to load plugin orquesta.composers.native. No 'orquesta.composers' driver found, looking for 'native'

I also had to disable python 3.6 for now.

guzzijones commented 1 year ago

@m4dcoder can you please take a look at this python 3.8 test failing. It looks like stevedore is not able to find the native, mock plugins to run the tests in python 3.8.

guzzijones commented 1 year ago

I did some testing with st2 and this definitely breaks with-items no longer report failure to the parent workflow.

(virtualenv) [vagrant@stackstorm]~/st2/tools$ st2 execution get 64adc15887cff69baae67294
id: 64adc15887cff69baae67294
action.ref: test_pack.test_fail_workflow
parameters:
  number: 5
status: running (635s elapsed)
start_timestamp: Tue, 11 Jul 2023 20:53:44 UTC
end_timestamp:
log:
  - status: requested
    timestamp: '2023-07-11T20:53:44.168000Z'
  - status: scheduled
    timestamp: '2023-07-11T20:53:44.245000Z'
  - status: running
    timestamp: '2023-07-11T20:53:44.317000Z'
result:
  output: null
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
| id                       | status                 | task      | action              | start_timestamp               |
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
| 64adc1581982b0a045f91e70 | succeeded (0s elapsed) | do_action | test_pack.test_fail | Tue, 11 Jul 2023 20:53:44 UTC |
| 64adc1591982b0a045f91e85 | succeeded (0s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e88 | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e8b | failed (1s elapsed)    | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e8f | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e92 | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
guzzijones commented 1 year ago

staged tasks need to be copied or with-items tasks will remain stuck if one sub action fails

guzzijones commented 1 year ago

I was actually able to get the json_8mb.json through this workflow with this change along with the changes from the zstandard compression over in the st2 PR queue

st2 run test_pack.test_workflow @json=/home/vagrant/json_8mb.json

action


from st2common.runners.base_action import Action

class Process(Action):
    def run(self, *args, **kwargs):
        return kwargs["json"]

workflow

version: 1.0

tasks:
  do_action:
    action: test_pack.test_action
    input:
      json: <% ctx().json %>
    next:
      - when: <%succeeded()%>
        publish:
          - bar: [1,2,3]
        do:
          - with_test
  with_test:
    with:
      items: <% ctx().bar %>
      concurrency: 1
    action: test_pack.test_action
    input:
      json: <% ctx().json %>
    next:
      - when: <% succeeded() %>
        publish:
          - with_output: <% result().result %>
input:
  - json

output:
  - with_output: <% ctx().with_output %>

output

+--------------------------+-------------------------+-----------+-----------------------+--------------------+
| id                       | status                  | task      | action                | start_timestamp    |
+--------------------------+-------------------------+-----------+-----------------------+--------------------+
| 64af43382c572b699fca6ea5 | succeeded (68s elapsed) | do_action | test_pack.test_action | Thu, 13 Jul 2023   |
|                          |                         |           |                       | 00:20:06 UTC       |
| 64af43d42c572b699fca6eb6 | succeeded (74s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023   |
|                          |                         |           |                       | 00:22:43 UTC       |
| 64af44722c572b699fca6ec7 | succeeded (79s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023   |
|                          |                         |           |                       | 00:25:20 UTC       |
| 64af45122c572b699fca6ed8 | succeeded (68s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023   |
|                          |                         |           |                       | 00:28:01 UTC       |
+--------------------------+-------------------------+-----------+-----------------------+--------------------+
guzzijones commented 1 year ago

I was able to get the st2 unit tests to also pass. I had to leave in the deep copy for the errors. I am testing this locally on our dev box.

Kami commented 1 year ago

Do we have any numbers on how much things are speed up by this change? Ideally, some micro + end to end benchmarks would be great (similar to what I've done for the DB layer changes).

When I was originally working on the database layer performance improvements and I implemented fast_deepcopy(), I also had a quick look if we could utilize it in Orquesta, but I found some blockers and Orquesta wasn't my priority back then so I didn't spend more time digging in.

IIRC, it was something related to complex types used by the state stuff since fast_deepcopy() only supports simple types aka no object instances.

If we can get those sorted out without any issues and the numbers indeed indicate performance improvements, then it would be a great change :)

Kami commented 1 year ago

On a related note to my comment above^ - I think it would be good to add a feature flag / config option for this functionality.

This will serve multiple purposes:

1) We will be able to use feature flag to benchmark both scenarios - regular deepcopy vs fast_deepcopy 2) In case user encounters issues with fast_deepcopy() which we didn't catch here, they can easily opt-out of this functionality and continue using StackStorm without needing to downgrade or similar.

guzzijones commented 1 year ago

We won't be able to add a feature flag without a change in ST2. Orquesta does not have the Stackstorm config object afaik. I will put some benchmarks in the PR and also make sure we double check that errors is also copied.

Kami commented 1 year ago

I will put some benchmarks in the PR and also make sure we double check that errors is also copied.

Thanks.

We won't be able to add a feature flag without a change in ST2. Orquesta does not have the Stackstorm config object afaik.

Yeah, I assume it would probably also require a change on the StackStorm side. If you have cycles, I still think that would be a good idea.

Still much easier to flip the feature flag / config option in case something goes wrong versus needing to downgrade to a previous release.

Especially if it's combined with a breaking backward incompatible change in StackStorm core (live action FK change - there is only a forward migration in that PR, but not a backward one).

rebrowning commented 9 months ago

I've been looking into some slowness for workflows with a large amount of items inside of a with items step, I saw this PR and I'm curious what items are we actually looking to have gated by a feature flag in this PR?

guzzijones commented 9 months ago

task render speed gain when removing deep copy for large parameters

guzzijones commented 9 months ago

I added a micro benchmark to show how much faster it is to NOT do a deep copy. I could add it for the serialize and deserialize, but the results will be the same. Copying large objects is costly.

guzzijones commented 9 months ago

all the 3.9 tests passed with these updates minus the lint checks). rpm is here if @rebrowning wants to test it out.

guzzijones commented 9 months ago

I am running this additional speed up on our internal fork of St2.
I will report back if I see any issues.

It looks like ci/cd failed because black changed with their latest release.

rebrowning commented 7 months ago

@Kami @guzzijones are we requiring the feature flag to switch between deepcopy and fastdeepcopy for this to move forward?

FileMagic commented 3 months ago

Note, this will help with "with items" speed issues.

guzzijones commented 1 month ago

@nzlosh @cognifloyd please take a peak at this. I adds significant speed improvements by eliminating some deep copies. This helps with larger context objects.