Adding debugging information to help app developers trace broadcast storms

julianna-langston commented 5 months ago

Enhancement Request

Use Case:

I’m building FDC3-enabled apps for demo purposes. For the sake of argument, let’s say I have a Price App that receives symbol contexts, and broadcasts price information. I also have an Order App that listens for price information, and sometimes broadcasts symbols. As you can imagine, the Price App and Order App can easily get into a situation where they bounce off of each other infinitely. (This is similar to the “broadcast storm” problem that rises in the Networking field).

Reference

flowchart LR;
    A[type = fdc3.instrument, symbol = MSFT]-->B([Price App]);
    B-->C[type = fdc3.valuation, price = 100];
    C-->D([Order App]);
    D-->A;

This was the kind of bug that was obvious in retrospect, but took some time to find. Since I was working on both the Price App and the Order App, I wound up adding metadata to the price and symbol contexts that were being broadcast. I added the metadata {threadId: “my-uuid-1"} to the price information. Then, when the Order App was “responding” to a context, it would pass along the threadId of the context it was “responding” to. Using this metadata, I was able to trace these psuedo-threads, and figure out where my storms were generating from.

It strikes me that I’m probably not the first person (and certainly not the last) who’s going to encounter these kinds of problems. The standard feels like the ideal place to lay out groundwork for how this situation would best be handled.

Summary: When 2+, independent apps respond to a series contexts/intents, it's possible to generate an infinite loop. What tools can we provide to help debug and/or prevent these infinite loops?

Proposal

Desktop Agents would add metadata to contexts and intents if that metadata doesn’t already exist. That metadata would have the property traceId, with a value in the style of a uuid v4.

When app developers listen for contexts/intents and “responding” to them, they would pass along the traceId of the context/intent that they’re “responding” to.

The Desktop Agent can track the number of messages using the same traceId, determine if a storm is brewing, and put a lid on it. I defer to the desktop agents themselves about how they would detect storms, and what they would do to stop them.

Workflow Example

Here is an example of how the proposal could work.

Price App broadcasts the following code:

{
  "type": "fdc3.instrument",
  "symbol": "MSFT"
}

The Desktop Agent injects a traceId into it, so that Order App receives:

{
  "type": "fdc3.instrument",
  "symbol": "MSFT",
  "traceId": "699381f3-284f-4014-a476-e114ef951cef"
}

The Order App will then send off price information about MSFT. Since it's responding to a context, it will include the traceId of the context it's responding to.

{
  "type": "fdc3.valuation",
  "price": 100,
  "CURRENCY_ISOCODE": "USD",
  "traceId": "699381f3-284f-4014-a476-e114ef951cef"
}

Then, when the price app receives this fdc3.valuation context, it will continue to forward the same traceId.

Let's say the desktop agent will allow the thread to have a maximum of 100 messages in a single minute. If these messages start to generate an infinite loop, the desktop agent can detect it and respond accordingly... perhaps, the DA will stop sending the messages, and maybe log an error or warning message. I defer to any individual DA for how they'd want to handle the situation.

This proposal would involve changing the FDC3 Context base object to include:	Property	Type	Required	Example Value
traceId	uuid	No	`"699381f3-284f-4014-a476-e114ef951cef"`

robmoffat commented 4 months ago

thanks for the idea, @julianna-langston! This makes a lot of sense, although I think from my perspective it's more in implementation land than standards side. What do the other maintainers think?

Personally, I've solved this problem by only broadcasting when the context changes rather than every time it gets set. i.e. keeping tabs on the latest version of each type of context.

kriswest commented 4 months ago

This makes a lot of sense, although I think from my perspective it's more in implementation land than standards side. What do the other maintainers think?

@robmoffat indeed, it'd be up to an implementation (either desktop agent or app) to populate the field, but as we standardize the context objects it'd be on the standard to provide a standard spot for the info. In the base Context schema would make sense (making it available to all sub-types, standardized or otherwise).

One other solution to consider is adding something to the ContextMetadata that is provided to ContextHandler and IntentHandler. However, the problem there is how do we let an app pass it in, which would require additional optional arguments on the broadcast, raiseIntent and raiseIntentForContext API calls. It'd be easy for a Desktop Agent implementation to add a threadId without that, but on rebroadcasting something or responding to something you need to be able to pass in the right value via an argument.

Finally, I'm not in love with adding a metadata element to Context, as we already provide one alongside it (ContextMetadata). Hence, there are two possible solutions (and I'm not sure which I prefer):

Add a threadId (or similar) to Context
Add an optional threadId argument to the broadcast, raiseIntent and raiseIntentForContext API calls and use it to populate an optional threadId field in ContextMetadata

julianna-ciq commented 4 months ago

Takeaways from the February 22 WG meeting:

The property shouldn't be called threadId. Current working property name: traceId
Let's put the property on the base context object
Proposed to make traceId a SHOULD in the current major version, and update to a MUST for the next major version
This could work with OpenTelementry. Should we include examples for the use case?

I will update the issue to reflect these changes.

rikoe commented 4 months ago

I am not sure this is a good idea for adding to the standard. It feels similar to the HTTP standard demanding that all HTTP requests must or should include the user-agent header (or any other example of HTTP header metadata).

But this feels slightly worse than that, in that it actually leaks random local implementation data in a publicly visible message. The particular uuid string is not useful to anyone else other than the application it originated from, so why should it be shared as part of a public message.

FDC3 context types sent as part of intents are meant to represent business workflows, not as a way to aid developers for sending extra data between applications (there are other ways of doing that).

In my experience building FDC3 applications, it is easy enough to get past the "circular intent / circular context broadcast" issue, because each app knows the intents and context broadcasts it is transmitting, and handling circular messages can be easily excluded with simple if statements. After all, the intents handling API call already includes both IntentMetadata and ContextMetadata that could be used to filter a message with if statements based on e.g. its AppIdentifier, i.e. source.

If you really do need to send some custom identifier for tracing to another app, the API has been designed to be extensible so that you can add any key-value pair to your message, and any app delivering an intent should include the whole context payload, including any extra included properties and values, so that it is not necessary to formalise these custom key-value pairs.

But to make this part of the standard just feels wrong to me, and seems a roundabout way to resolve a development issue that could be easily resolved in other ways as well.

nkolba commented 4 months ago

Completely agree with @rikoe here. If the goal is to avoid circular workflows in your implementation. Seems like the best place to this would be behind the scenes in the desktop container/message broker rather than leaking this into the context data definition. Either way, this could be done today in the specific implementation without changing the standard.

kriswest commented 4 months ago

@rikoe @nkolba The discussion of this proposal at the Standards Working Group was interesting and isn't yet fully represented here (nor are the meeting minutes posted yet, I'm still working on those). A number of participating firms picked up the discussion and related some past experiences as well as interest in making some form of enhancement to support, with another maintainer suggesting that it should become required in future (not sure I wholly agree, but we're not finished discussing the topic as this was the last issue in the meeting and we ran out of time).

The suggestion was to add a standard place that an app could put information that would help to relate different API calls/context messages back to the same interaction. That has multiple applications, including debugging and (as we heard from a couple of participants) tracking/analytics on those interactions wherever they involve multiple exchanges or hops. As those interactions may span multiple applications and hence multiple developers or firms, standardization can help by ensuring that the information is put in a consistent place. That's why the particular UUID could actually be useful to the desktop owner (for analytical use) or other app developers (replicate the same ID if responding to a particular message or continuing a particular workflow).

Adding a recommended spot in the context schema would also mean that location is consistent across different broadcasts, raised intents etc., which makes it more useful. I personally think it should be optional and (as I commented in the meeting) it cannot be required until at least an FDC3 3.0 version as it would be a breaking change (at least the way I interpreted the comment at the meeting - see below).

There are alternatives to a consistent location in the context schema that are possible, for example broadcast and raiseIntent (all version in each case) could both pick up an additional optional argument, which would allow the id to be placed in the ContextMetadata object by the Desktop Agent instead. That id might carry over different intents/broadcasts of course.

Either way, this could be done today in the specific implementation without changing the standard.

True, but there are advantages to the standard taking a view and, if the API approach were chosen, encouraging adding arguments to a standardized API isn't generally a good thing to encourage as it breaks that same standardization. The discussion covered the fact that anyone could add fields to contexts to do this, but it was also specifically about the standard lending a hand by recommending a standard way to do it that would then be more often usable across multiple apps/vendors/steps without prior coordination (that being what we try to achieve via standardization). I suspect the comment about this being required may have related to using the proposed location for the property rather than always filling it in - but I'm just guessing and will leave it to the commenter to clarify.

The topic will be only the SWG meeting agenda to discuss next month - if my explanation of the discussion hasn't helped, come along to discuss it further.

P.S. I didn't quite understand the concern about "revealing random local implementation data". Are you worried about the UUID generation in some way?

nkolba commented 4 months ago

It seems like the intended consumer of the traceId is the desktop agent. Is there some reason that apps need to read this field? Is there a reason a desktop container wouldn't already have access to this information in its internals?

kriswest commented 4 months ago

It seems like the intended consumer of the traceId is the desktop agent

Only in one of the two use cases proposed (and then only to facilitate passing onwards to an analytics process).

The relationship between two different API calls, made by two different apps that would be tied together by using the same id, is not necessarily accessible to the desktop agent or container. Actions in a particular workflow may be distributed in time, interleaved with/overlap with actions from a different workflow or adjacent to actions from a different workflow that they need to be differentiated from. This can happen in even the simplest of use cases as broadcasts and raised intents are processed within an application before new calls outward are generated.

FYI, this is not my proposal, I'm just filling in what I know of it from the meeting discussion.

kriswest commented 4 months ago

For the tracking use case, I actually wonder whether a single id field is sufficient... Whether the additional metadata goes into the context object or ContextMetadata, it then starts to remind me of this issue and use case, which also wants to supply additional (optional) metadata alongside the context message:

829

With several use cases that all want some form of additional metadata to be passable alongside the context message, I think there's a conversation worth having.

kriswest commented 5 days ago

@julianna-ciq we're overdue to update this issue to reflect the more important use cases and to incorporate the proposed changes at meeting #1196 (see minutes posted at bottom of that issue for details)

finos / FDC3