boomerang-io / community

The Boomerang community, roadmap, planning, and architecture repository. The central place for information on joining, contributing, and governance.

https://useboomerang.io

Apache License 2.0

11 stars 0 forks source link

Produce events for status changes of a workflow activity #326

Open corcoja opened 2 years ago

corcoja commented 2 years ago

Produce events for status changes of a workflow activity

Feature request

Description

As part of eventing, Flow will produce CloudEvents when the status of a workflow activity changes. These CloudEvents will be published to either a dedicated NATS Jetstream Stream or in the same Stream as for workflow triggers and WFE events (see suggestion below).

Solution

NATS Jetstream properties

Flow is configured to use a Stream with a name that gets its value from a Spring property eventing.jetstream.stream.name and a subject with a value from eventing.jetstream.stream.subject. These two properties are defined in the configmap of the Helm Chart as follows:

eventing.jetstream.stream.name={{ .Release.Name }}-event-stream
eventing.jetstream.stream.subject={{ .Release.Name }}.event.cloudevent

Event categories

One suggestion would be to use the same Stream for storing Flow-related CloudEvents, but to classify the types of CloudEvents into 3 categories (with the possibility to add more categories as needed):

{{ .Release.Name }}.trigger - subject for trigger events (for listener service)
{{ .Release.Name }}.wfe - subject for WFE-related events
{{ .Release.Name }}.activity.statuschange - subject for workflow activity status change events
- This can be potentially extended to a subject-per-status naming convention, e. g.: {{ .Release.Name }}.activity.statuschange.success for workflows that succeeded, {{ .Release.Name }}.activity.statuschange.inprogress for workflows that are in progress, etc.

With the above subject categories, there will be 2 required NATS Jetstream Consumers for Flow and as many as needed Consumers for external services that would like to consume CloudEvents when the status of a workflow activity changes:

Required Consumer for consuming trigger events with a filter on {{ .Release.Name }}.trigger.> (see https://docs.nats.io/nats-concepts/subjects#wildcards for more information on > sign and wildcards)
Required Consumer for consuming WFE-related events with a filter on {{ .Release.Name }}.wfe.>
Optional one/many Consumers for consuming activity status change events with a filter on {{ .Release.Name }}.activity.statuschange.>

NOTE: Classifying event types into multiple categories (as described above) will not be backwards compatible with current version of eventing in Flow.

Workflow activity statuses

Status update CloudEvents will be produced for the following activity statuses (please correct if any statuses are missing):

In Progress
Completed
Failure
Invalid
Waiting
Cancelled

CloudEvent for activity status change

When the status of a workflow changes, the published message to the NATS Jetstream will be a serialized UTF-8 encoded JSON text representing a CloudEvent (identically to trigger and WFE events).

Status change CloudEvent attributes (as recommended by NATS Protocol Binding for CloudEvents):

Attribute Name	Attribute Description
id	Event unique identifier
specversion	Specification version
subject	Message subject (can be also retrieved by the Consumer)
type	The type/category of the event (can also be derived from the subject)
source	Source of the event (can be set to the name of service that generated this event)
time	Time when the event was produced
datacontenttype	Media-type expression
data	Additional data that is encoded as declared by the `datacontenttype` attribute

Inside the data attribute we can add information about the workflow, execution and the new status. Example:

{
  "id": "db36ed0c-4cc6-44c6-86da-87bfe822bd00",
  "specversion": "1.0",
  "subject": "bmrg-flow.activity.statuschange.609be8a58ac40f7b1bcb425e",
  "type": "activity-status-change",
  "source": "service.controller",
  "datacontenttype": "application/json",
  "time": "2022-01-26T21:43:00Z",
  "data": {
    "workflowId": "609be8a58ac40f7b1bcb425e",
    "activityId": "61f1c4638dc16068c15c5ef4",
    "status": "success",
    "labels": {
      ...omitted...
    }
  }
}

Benefits

By providing CloudEvents whenever the status of a workflow changes, external services like Core can consume these kind of messages to be aware when a certain workflow has either completed its execution or has failed. A more specific example is Request Processing that is part of Admin – certain requests are backed by workflows. Once the execution of these workflows has been completed, Admin service must be notified in order to continue with the request.

Besides all above, I think it's just a cool feature to have! 😎

tlawrie commented 2 years ago

Hi @corcoja thanks for this enhancement. It definitely is a cool feature. Being able to produce events for changes has been an item @marcusdroy and I have wanted for a long time.

Let me dive into the detail over the next few days and come back with some questions. I also want to see if we can learn anything from the way that Tekton publishes its events.

corcoja commented 2 years ago

Hey @tlawrie, a few concerns that @popagruia outlined:

Publishing activity status changes might potentially present a security leak, since these messages can be consumed by anyone with access to NATS. This won't be an issue if NATS will be used internally only, but if externalized, then we probably need to look into ways to segregate activity status changes CE (e. g. have dedicated streams for each team).
If a certain service wants to consume messages just for one particular execution (activityId) of a workflow (workflowId), with the above design that would mean that it needs to consume all the messages until it gets to the ones with the activityId. This is computationally heavy and unnecessary. One way to solve this is to add the workflowId and/or activityId to the subject of the published messages:
- bmrg-flow-activity.statuschange.609be8a58ac40f7b1bcb425e.61f1c4638dc16068c15c5ef4. This way, it is possible to create consumers with a filter to consume only messages that one is interested in.

corcoja commented 2 years ago

@tlawrie, a couple more suggestions from @marcusdroy:

For each workflow add the option to either enable or disable eventing. By enabling, Flow it will start generating CE for activity status changes just for that specific workflow. It would limit the number of CE generated but won't prevent these to be consumed by anyone with access to NATS. I think it would be acceptable to go this way (at least for the first version of this feature) as long as we add a disclaimer that the generated CE are public and can be consumed by anyone with access to NATS.
Instead of adding an eventing option on workflow configuration, add it to the execution. This option can be enabled either from UI of the app or sent as an attribute to the webhook/listener. When enabled, Flow will generate activity status changes just for this workflow execution.

Thoughts?

tlawrie commented 2 years ago

Hi @corcoja just closing out Flow 3.6.1 and then ill be able to make a proper review. Sorry for the delay.

corcoja commented 2 years ago

Hey @tlawrie! Have you had the time to look into the request and make a review?

tlawrie commented 2 years ago

Hi @corcoja I've delved into this PR and I apologise for the latest and if this is a long review. Also happy to have a call about this or discuss on the community working group

To start with, I wholeheartedly agree with this comment 😃

Besides all above, I think it's just a cool feature to have! 😎

Notes Of first note, depending on how much this changes the incoming event format and backwards compatibility, we may have to alpha this feature or release as part of 4.0.0 which currently has broader re-architecture plans as well.

Secondly, I did a bit of research using Tekton, Jenkins, Knative, Google, Kubernetes, and CloudEvents as my benchmark.

Comments

With respect to producing events that are consumable by anyone and may pose a security risk, I propose that w.r.t getting this feature to beta, for now, we have this as a System Setting for Administrators. This is similar to what other systems have done such as TektonCD and Jenkins.
When we add to System Settings as a feature flag we also add in the type of events. See the Jenkins example of selecting the events to produce.
Propose creating a linked backlog task for investigating a design for future team-based streams etc (potentially allow configuration on a workflow for a specific sink and that way it's up to the end system to configure the sink URL per workflow or team (preference). And maybe this is on top of a system-wide sink that can send all events.
Happy for Workflows to 'opt-in' as to whether to produce events, however, I am not sure whether end users will know what this means, especially as with (1) and (2) above it's becoming more of an administration/system based functionality for now. I also think adding the alternative option of a task that does the producing, however this moves away from how other systems are implementing it. Regardless, I suggest we create a new task in the backlog for future consideration for how Workflow Specific Event Producing is handled and put both options in that task to be discussed in future.
In terms of the structure, there are a few caveats/updates that we should consider here and I think our baseline is the CloudEvent spec but also standard practice.
- Type: "SHOULD be prefixed with a reverse-DNS name". I propose we use io.boomerang.events.xyz or io.boomerang.event.xyz (plural vs non) which is different to what we currently consume of io.boomerang.eventing.xyz
- Subject: we already have a format of /workflowId/topic or /<workflowId>/<workflowActivityId>/<topic> which I think still works here. Although happy to change this to the style Azure Event Grid uses which would be /workflow/<workflowId>/activity/<workflowActivityId>/topic/<topic>
- Source: we follow what Kubernetes/Tekton do and make it specific to the object type or path for an API driven system i.e. /apis/v1/activity or /apis/v1/events or even more granular /apis/v1/events/workflow/activity/status etc
- Can we filter on type instead of the subject? as it's more common to use the type and subject as we do above.
```
{
"specversion": "1.0",
"type": "io.boomerang.event.workflow.activity.status",
"subject": "/workflow/5f7e2c8969f04975a0fff357/activity/6010b41bbadf2e7743e03324",
"source": "/apis/v1/activities",
"id": "C234-1234-1234",
"time": "2018-04-05T17:31:00Z",
"datacontenttype": "application/json",
"data": {
...make same as object we use in the APIs whether workflow or activity etc...
}
}
```
In terms of types of events we want to produce, I propose we keep them to a minimal subset for now as follows and then we can add more objects or types other than status changes.
- Workflow Status Change: Created, Updated, Deleted
- Workflow Activity Status Change: NotStarted (we should change this to queued), In Progress, Waiting, Cancelled, Failed, Completed (this means succeeded), Invalid.
We could optional extend the ce fields with a labels field (both for binary and structured) although again i think this is a future enhancement

Summary

Creation of a system-wide event producing in raw direct to NATS (no sink URL needed that handles the post)
Start with producing workflow and workflow activity status change events
Future consideration of Team or Workflow based event producing to a configurable sink separate to the system-wide configuration
Future consider of adding different types of endpoints to receive the system-wide events i.e. Sink URL and Sink Type (NATS|HTTP|KAFKA|etc)

Backwards compatability To be able to release this in alpha format prior to 4.0.0 I suggest we:

Use a feature flag
Support type of io.boomerang.event and io.boomerang.eventing until 4.0.0 but mark io.boomerang.eventing deprecated and to be removed in 4.0.0. This change is for both consuming and producing events.
Support subject of /workflow/123/activity/123 and /123/123 until 4.0.0 but mark /123/123 as deprecated and to be removed in 4.0.0. This change is for both consuming and producing events.
I don't think there is harm or backwards compatibility in changing the source field

tlawrie commented 2 years ago

Also for future testing, it would be interesting to see if we can use this for visual look of the events -> https://github.com/ruromero/cloudevents-player

corcoja commented 2 years ago

Hi @tlawrie, thank you for the thorough analysis. I like the idea of having the eventing option added to the System Setting for administrators, so let's go this way.

Feedback on your comments

With respect to producing events that are consumable by anyone and may pose a security risk, I propose that w.r.t getting this feature to beta, for now, we have this as a System Setting for Administrators. This is similar to what other systems have done such as TektonCD and Jenkins.

Having administrators to enable or disable eventing, as well as what kind of events are generated is great! My only comment here is that we should be really specific in the docs or popup boxes what eventing does, how it works and how it should be configured for another systems to communicate with Flow through CE in NATS.

When we add to System Settings as a feature flag we also add in the type of events. See the Jenkins example of selecting the events to produce.

👍🏻

Propose creating a linked backlog task for investigating a design for future team-based streams etc (potentially allow configuration on a workflow for a specific sink and that way it's up to the end system to configure the sink URL per workflow or team (preference). And maybe this is on top of a system-wide sink that can send all events.

Sounds good. I might look into that once we finish with the first iteration of events for workflow status updates.

Happy for Workflows to 'opt-in' as to whether to produce events, however, I am not sure whether end users will know what this means, especially as with (1) and (2) above it's becoming more of an administration/system based functionality for now. I also think adding the alternative option of a task that does the producing, however this moves away from how other systems are implementing it. Regardless, I suggest we create a new task in the backlog for future consideration for how Workflow Specific Event Producing is handled and put both options in that task to be discussed in future.

I would proceed with eventing being managed from the administrator's System Settings. For future releases we can look further into a more granular CE generation.

In terms of the structure, there are a few caveats/updates that we should consider here and I think our baseline is the CloudEvent spec but also standard practice.

Type: "SHOULD be prefixed with a reverse-DNS name". I propose we use io.boomerang.events.xyz or io.boomerang.event.xyz (plural vs non) which is different to what we currently consume of io.boomerang.eventing.xyz.

Yes, this does make more sense. Happy to make the type as suggested with the non-plural variant.

Source: we follow what Kubernetes/Tekton do and make it specific to the object type or path for an API driven system i.e. /apis/v1/activity or /apis/v1/events or even more granular /apis/v1/events/workflow/activity/status etc.

/apis/v1/activity is too specific to activities. Published and consumed CEs by Flow will include triggers as well WFE, so /apis/v1/events I think fits best for our scenario. The more granular option contains too much redundant information, since you can derive a certain CE is for an activity status change by the looking at the subject.

Subject: we already have a format of /workflowId/topic or /<workflowId>/<workflowActivityId>/<topic> which I think still works here. Although happy to change this to the style Azure Event Grid uses which would be /workflow/<workflowId>/activity/<workflowActivityId>/topic/<topic>.

Can we filter on type instead of the subject? as it's more common to use the type and subject as we do above.

The reason I proposed 3 subject categories is that we need a logic when CE messages are stored into a Jetstream Stream. I agree to keep the subject field within the CE payload as-is, however we need to decide on the NATS message subject, which is different (https://docs.nats.io/nats-concepts/subjects).

For filtering messages, NATS offers only subject-based filtering. There is no possibility to filter messages based on the information inside the CE payload, as this requires unmarshalling the message, looking up fields of interest and then comparing the results – too CPU intensive. We could of course leave the CE payload-based filtering on the client side (that is interested in consuming messages), but this would be non-optimal for them if, for example, they are interested only in completed activity status events, since this would require to read all the messages generated by Flow just to find the ones they need.

I do like the topic-based filtering though, so we could use this as the NATS message subject (notice I'm using the . as per NATS naming convention):

For trigger/webhook: io.boomerang.event.workflow.trigger
For WFE: io.boomerang.event.workflow.wfe
For activity status change: io.boomerang.event.workflow.activity.status

Flow Eventing - NATS - 3 Streams

Notice that in the above diagram, COMPLETED-STATUS and MONITOR Consumers are created by the client. These are put there just for demonstration purposes on how a certain client could consume messages from Flow.

Two more things to make filtering even better:

Add the new status at the end of the subject, e. g.: io.boomerang.event.workflow.activity.status.completed
If a workflow was triggered by a CE in NATS, i. e. someone published a CE with subject io.boomerang.event.workflow.trigger, get the suffix from that subject and append it at the end of all the produced CE for status updates. Example:
1. A client is interested only in consuming CE for completed workflows that were triggered by him.
2. The client publishes a CE with the following subject: io.boomerang.event.workflow.trigger.cool.client.
3. The client also creates a Jetstream push-based consumer connected to STATUS_UPDATE stream with subject filter set to: io.boomerang.event.workflow.activity.status.completed.cool.client.
4. When the workflow that was triggered by the client changes its status to completed, the CE associated to this event will be pushed to the client. No other completed status updates CE will be pushed to the this client unless these will have the prefix cool.client.

In terms of types of events we want to produce, I propose we keep them to a minimal subset for now as follows and then we can add more objects or types other than status changes.

Workflow Status Change: Created, Updated, Deleted

Workflow Activity Status Change: NotStarted (we should change this to queued), In Progress, Waiting, Cancelled, Failed, Completed (this means succeeded), Invalid.

Agreed for both. As for the first iteration, our focus will be on Workflow Activity Status Change since this is a feature requested by Core. Afterwards, we can also implement the Workflow Status Change.

Backwards compatibility

Changing the payload indeed won't affect much the implementation of eventing as-is.
Changing the NATS message subjects as described above won't be compatible with the current implementation of eventing, unless we create another stream and consumer to support current functionality.

timrbula commented 2 years ago

WIP but getting close to being done
Existing and future PRs to be merged that impact multiple services

Update from @corcoja

timrbula commented 2 years ago

@corcoja is this corrected scheduled for release in Flow 4.0.0?

corcoja commented 2 years ago

@timrbula, yes, the target release for Eventing is 4.0.0. I have completed the work on the main codebase for eventing and it can be already released as a beta feature. There are a few more outstanding items that I'm still working on:

Control Eventing from Settings (on/off toggle).
Update the models in service.listener.
Small improvements that @AndreiOla is working on.
More testing.
Docs.

tlawrie commented 2 years ago

I'm not sure if this would be in 4.0.0, as that was meant to be the re-architecture release and the eventing changes were going to be backwards compatible and could go into a 3.x.x release.

Is that still the case? or are the changes looking rather far reaching?

corcoja commented 2 years ago

Is that still the case? or are the changes looking rather far reaching?

Since the upgrade of CloudEvents to v3, the changes are not backwards compatible. We are making use of CloudEvents extensions which are crucial for Eventing functioning.

tlawrie commented 2 years ago

HI @corcoja what part of the event is not backwards compatible though? they should still be able to be processed without the extensions when coming in through the listener? And the event structure shouldn't have changed even with the upgrade to v3.

corcoja commented 2 years ago

And the event structure shouldn't have changed even with the upgrade to v3.

We did make significant changes to the structure since the information provided by CE is richer and contains space for future expansions.

The idea was to split CE into two dedicated categories (for now at least, other categories can be added):

Action CE.
Status updates CE.

The category and type of the CE can be identified by parsing the subject field from the payload. We followed the CE guidelines and added just the bare minimum required for new version of Eventing.

Action category

Trigger

This CE is consumed by Flow and will trigger a workflow if the payload is correct and data is valid.

{
    "id": "552c2076-e481-418f-ac5c-cbe4cbdf4cc1",
    "type": "io.boomerang.eventing.action.trigger",
    "source": "/postman/local",
    "specversion": "1.0",
    "datacontenttype": "application/json",
    "subject": "/<workflowId>/<topic>",
    "token": "<workflow_token>",
    "time": "2021-07-30T22:34:26.799Z",
    "initiator_id": "core",
    "initiator_context": "L2FjdGlvbl9pZC9hY3Rpb25fdHlwZS9jYXRhbG9nX2l0ZW1faWQ=",
    "data": {
      "prop1": 2,
      "prop2": "Text!"
    }
}

Wait for event

Also consumed by Flow and will complete/end a WFE task with the provided status only if the payload is correct and data is valid.

{
    "id": "552c2076-e481-418f-ac5c-cbe4cbdf4cc1",
    "type": "io.boomerang.eventing.action.wfe",
    "source": "/postman/local",
    "specversion": "1.0",
    "subject": "/<workflowId>/<workflowActivityId>/TestWFEheartbeat",
    "token":"<workflow_token>",
    "status": "success",
    "time": "2021-07-30T22:34:26.799Z"
}

Cancel

Consumed by Flow, will cancel a running Workflow only if the payload is correct and data is valid.

{
    "id": "552c2076-e481-418f-ac5c-cbe4cbdf4cc1",
    "type": "io.boomerang.eventing.action.cancel",
    "source": "/postman/local",
    "specversion": "1.0",
    "subject": "/<workflowId>/<workflowActivityId>",
    "token":"<workflow_token>",
    "time": "2021-07-30T22:34:26.799Z"
}

Status update category

Workflow status update

These types of CE will be produced by Flow when the status of a Workflow changes. External clients will have the possibility to create consumers (with or without filters) and consume these type of messages.

{
  "specversion": "1.0",
  "id": "b1e6b2f0-5514-4ade-874e-a6c8802cad5c",
  "source": "io.boomerang.flow",
  "type": "io.boomerang.eventing.status_update.workflow",
  "datacontenttype": "application/json; charset=UTF-8",
  "subject": "/<workflowId>/<workflowActivityId>/<status>",
  "time": "2022-06-07T15:27:48.488Z",
  "data": {
    "workflowid": "<workflowId>",
    "workflowactivityid": "<workflowActivityId>",
    "status": "<status>",
    "outputProperties": [
      {
        "task1.key1": "value1"
      },
      {
        "task1.key2": "value2"
      },
      {
        "task2.key3": "value3"
      }
    ]
  }
}

Status updates CE for a workflow will be produced only when the workflow status changes to one of these:

In progress
Invalid
Completed
Failure
Cancelled

Task status update (not implemented yet)

As proposed in https://github.com/boomerang-io/flow.service.workflow/pull/209#issuecomment-1187312052. These types of CE will be produced by Flow when the status of a Task changes. Same as above, external clients will have the possibility to create consumers and consume these type of messages as needed.

{
  "specversion": "1.0",
  "id": "b1e6b2f0-5514-4ade-874e-a6c8802cad5c",
  "source": "io.boomerang.flow",
  "type": "io.boomerang.eventing.status_update.task",
  "datacontenttype": "application/json; charset=UTF-8",
  "subject": "/<workflowId>/<workflowActivityId>/<taskId>/<status>",
  "time": "2022-06-07T15:27:48.488Z",
  "data": {
    "workflowid": "<workflowId>",
    "workflowactivityid": "<workflowActivityId>",
    "taskid": "<taskId>",
    "status": "<status>",
    "outputProperties": {
      "task1.key1": "value1"
    }
  }
}

Status updates CE for a task will be produced only when the task status changes to one of these:

In progress
Invalid
Completed
Failure
Cancelled
Waiting

gchickma commented 2 years ago

@corcoja Is the intent to control the proposed Task status update process by task type, i.e. a new attribute on the task template? Or would all tasks types generate, at a minimum, In progress and Completed CE's? If there's no concern wrt the volume of messages then I think this proposal will satisfy our internal requirements.

corcoja commented 2 years ago

@gchickma The intent is to produce CE for all types of tasks with at least In Progress and Completed/Failure statuses.

If there's no concern wrt the volume of messages then I think this proposal will satisfy our internal requirements.

I don't see any concerns here – NATS Streams will be configured to hold a certain amount of messages and once full, oldest messages will be discarded. The average size of trigger CE with 3 properties is around 560 characters, i. e. 560 bytes. Adding additional NATS message metadata, let's assume that the size will increase to 1 KB. Having a space limit for a Stream set to 1 GB, we could store at most 1.000.000 messages. If within a day Flow will produce approx. 10.000 messages, the oldest message will be discarded after 100 days, which is more than enough time for external clients to consume this message.

gchickma commented 2 years ago

@corcoja Ok, so then I tender my support for the new Task status update proposal

@marcusdroy @tlawrie Thoughts, concerns?

corcoja commented 2 years ago

@tlawrie @marcusdroy @gchickma

We also want to introduce with Eventing the possibility to control from the Settings screen what type of messages will be produced by Flow, see screenshot:

Enable eventing – this toggle will control whether eventing is enabled or disabled globally. Disabling it will also disable the communication between service.workflow and service.listener via NATS server and fallback to HTTP calls.
Produce status update events for workflows – if enabled, Flow will produce status update CEs for workflows. If disabled, no status updates will be produced, but Flow will still consume CEs from action category.
Produce status update events for tasks – idem, but for tasks.

Please not that if "Enable eventing" is disabled, the toggle status of "Produce status update events for workflows" and "Produce status update events for tasks" will be ignored and considered as disabled.

gchickma commented 2 years ago

@corcoja I think the above proposed Settings make sense. Perhaps in future iterations we may need a more granular approach to enabling messages for each type (Workflow or Task) by Status type (Not Started, In Progress, etc) but for a first iteration I agree with this proposal.

tlawrie commented 2 years ago

Hi @corcoja some great updates here and thank you for the detailed breakdown. It helps with context and understanding of the incentive from IBM.

✅ Status CE I think Status Events are a great addition and always part of the roadmap.

✅ Feature Flags Sounds good. I think long term we want to be able to have the CE Events be redirected to different Syncs that teams might set up. However, feature flags as a start are great.

I don't think having the flag to disable the eventing between Workflow and Listener is a problem. Longer term, we want to merge Listener into the API and separate it from Workflow into a new service.

❓ Event Type Prefix Based on prior discussion, I believe we are moving to the non-plural io.boomerang.event.xyz from io.boomerang.eventing

❓ Event Type Categories / Groupings I am not entirely certain if we need the Action category. Is the main aim to group trigger and WFE? problem is that Action is not terminology in Flow other than Manual Actions. And its a loaded term in various workflow solutions.

Can we leave them as Trigger (currently we use Custom) and WFE as they are the different parts of the system? and WFE a Workflow concept? I guess that leaves Cancel without a real grouping 🤔

I propose that we go io.boomerang.event.workflow.cancel

Type	Description
`io.boomerang.event.workflow.cancel`	Cancels a running Workflow
`io.boomerang.event.workflow.trigger`	Triggers a Workflow
`io.boomerang.event.workflow.wfe`	Sends a Wait for Event

❓ Event Type Status Updates Following on from the above, previously we had documented io.boomerang.event.workflow.activity.status as the type. I note above its shown as io.boomerang.eventing.status_update.task

The reference format from Tekton is dev.tekton.event.taskrun.successful would that also work here?

We would then have io.boomerang.event.workflow.status.<status> io.boomerang.event.task.status.<status>

The above is still reasonably filterable. The only major difference is that we use the terminology 'activity' instead of 'run' so I guess it could also be io.boomerang.event.activity.workflow.status.<status> or io.boomerang.event.workflow.activity.status.<status> io.boomerang.event.activity.task.status.<status> or io.boomerang.event.task.activity.status.<status>

My preference is the io.boomerang.event.workflow.status.<status> format

Type	Description
`io.boomerang.event.workflow.status.<status>`	Emits workflow status updates.
`io.boomerang.event.task.status.<status>`	Emits task status updates.

❓ Event Subject Changes Currently we have the /workflow/5f7e2c8969f04975a0fff357/activity/6010b41bbadf2e7743e03324 as the subject that includes a workflowId and activityId, is this new proposal to adjust this to /workflow/5f7e2c8969f04975a0fff357/6010b41bbadf2e7743e03324.

I think the previous decision was:

Subject: we already have a format of /workflowId/topic or /// which I think still works here. Although happy to change this to the style Azure Event Grid uses which would be /workflow//activity//topic/

❓ Status Event Source Field In one of the above examples its show as "source": "io.boomerang.flow"

I think from our previous decisions its to be URI path based

Source: we follow what Kubernetes/Tekton do and make it specific to the object type or path for an API driven system i.e. /apis/v1/activity or /apis/v1/events or even more granular /apis/v1/events/workflow/activity/status etc

With the decision being /apis/v1/events with more context as needed.

✅ Cancel Event This looks like a good addition. Only dependent on the Type Grouping decision.

❓ Token Element Is this meant to be the AuthN/AuthZ token? if so, isn't that part of the header or URL params? as access_token or Authorization

❓ Backwards Compatability What are the major differences in the Trigger and WFE payloads? What is deprecated or changed? I am open to not bothering with backwards compatibility for this change.

❓ Timeline / Change Radius With my time working with the Tekton team on TEPs, the ultimate aim is to always

Identifying the smallest change that can be made that sets it up / allows the future work - landing this smallest changes helps get tangible feedback to influence the direction of future work
Finding a way to experiment with the different options, if it's not too costly, instead of weighing them theoretically - Feature Flags

With that in mind, I am wondering, rather than big bang, whether we can start getting the Workflow CE changes in (once we all come to agreement on the changes) and then release it as an alpha flag? Or what timeframe are we looking at? I am just concerned at the big bang nature of the change.

[New] API Context Change I propose that as part of this change, we merge events into /apis/v1/events (which in itself should be /api - that we would definitely make backwards compatible).

I've been wondering when we make this change, as ultimately we would want the API and Front End services separate from the future combined Workflow Execution code + Controller. Ultimately the next gen architecture was still two microservices

API + Front End supporting services
Controller + Execution

I think we could start on this with the removal of the listener and merging that into the Workflow API service. And then eventually shuffle execution into controller.

[New] Emitting Events Non Blocking When we emit the events are we using threads? Essentially in a parallel routine to allow for retries without blocking the execution.

tlawrie commented 2 years ago

Hi @corcoja checking in on the above thoughts.

corcoja commented 2 years ago

Hey @tlawrie, see my comments:

❓ Event Type Prefix

Based on prior discussion, I believe we are moving to the non-plural io.boomerang.event.xyz from io.boomerang.eventing.

Not an issue, this can be easily changed, however, the feature itself is still named Eventing right? (for docs purposes)

❓ Event Type Categories / Groupings

I'm with you on event types. To summarize, the updated eventing will have the following event types:

Type	Description
`io.boomerang.event.workflow.cancel`	Cancels a running Workflow
`io.boomerang.event.workflow.trigger`	Triggers a Workflow
`io.boomerang.event.workflow.wfe`	Sends a Wait for Event
`io.boomerang.event.workflow.status.<status>`	Emits workflow status updates
`io.boomerang.event.task.status.<status>`	Emits task status updates

Now, my reasoning behind splitting events into two dedicated categories is that the first 3 types of messages (trigger, wfe and cancel) will be consumed only by Flow and will perform some action. The other two on the other hand, will be produced by Flow (and consumed by external services) and will acknowledge on some changes, i. e. status update.

However, I do agree with your suggestion and we can keep them together without the separation into multiple categories.

❓ Event Subject Changes

Currently we have the /workflow/5f7e2c8969f04975a0fff357/activity/6010b41bbadf2e7743e03324 as the subject that includes a workflowId and activityId, is this new proposal to adjust this to /workflow/5f7e2c8969f04975a0fff357/6010b41bbadf2e7743e03324.

Previous implementation had the format /<workflowId>/<topic> for trigger events, so I kept it as-is and adjusted the new types of events to match this format. Happy to change it though. My proposal:

CloudEvent Subject	Description
`/workflow/<workflow_id>/topic/<ce_topic>`	Trigger a workflow
`/workflow/<workflow_id>/activity/<activity_id>/topic/<ce_topic>`	Wait For Event
`/workflow/<workflow_id>/activity/<activity_id>`	Cancel a running workflow
`/workflow/<workflow_id>/activity/<activity_id>/status/<status>`	Workflow status update
`/workflow/<workflow_id>/activity/<activity_id>/task/<task_id>/status/<status>`	Task status update

❓ Status Event Source Field

In one of the above examples its show as "source": "io.boomerang.flow"

I think from our previous decisions its to be URI path based

Source: we follow what Kubernetes/Tekton do and make it specific to the object type or path for an API driven system i.e. /apis/v1/activity or /apis/v1/events or even more granular /apis/v1/events/workflow/activity/status etc

With the decision being /apis/v1/events with more context as needed.

Okay, agreed. Will change it to /apis/v1/events for events produced by Flow.

❓ Token Element

Is this meant to be the AuthN/AuthZ token? if so, isn't that part of the header or URL params? as access_token or Authorization.

Nein, this is the token that is configured on the workflow and is intended to check that the person triggering the workflow does indeed own this workflow. This is to prevent the ability for anyone to trigger any workflow.

❓ Backwards Compatability

What are the major differences in the Trigger and WFE payloads? What is deprecated or changed? I am open to not bothering with backwards compatibility for this change.

Given the above, these are the changes in the payload.

Event type
Event subject
Source
Token

The data field is still reserved for workflow parameters, so no changes there.

❓ Timeline / Change Radius

Good news is that most of the work is already completed. There are still some small changes left (including the above) in the workflow service and a small refactoring for listener service to make it compatible with new Eventing. And obviously testing and documentation.

The bad news is the mostly me and @AndreiOla are working on this, and right now we're focused on some other projects. I will resume my work on Flow in a week (or so). Not sure about @AndreiOla though.

tlawrie commented 2 years ago

Hi @corcoja that's fantastic. Thanks for the update. One further question.

❓ Token Element I'm not sure I follow this one. I think you can use the Workflow token as part of the authorization system. In that, it takes it from either the 'Authorization' header or 'access_token' URL parameter and validates that its correct for the Workflow

Are you saying, we need to do a further check as the event gets passed through?

corcoja commented 2 years ago

❓ Token Element I'm not sure I follow this one. I think you can use the Workflow token as part of the authorization system. In that, it takes it from either the 'Authorization' header or 'access_token' URL parameter and validates that its correct for the Workflow

Are you saying, we need to do a further check as the event gets passed through?

This is true for events passed through API calls – the token will be retrieved from Authorization header, access_token URL param or the token field inside CE’s payload. But when Flow consumes the CE from NATS, token will be retrieved from the token field only, so in this case it is mandatory to be present and valid.

tlawrie commented 2 years ago

Hi @corcoja that makes a lot of sense. We will have to make sure that's in the doco 👍

Let me know how you go. Do we put off the refactoring of the listener service into API until you are done?