App Liveness Check - Githubissues

robmoffat commented 2 months ago

Minor Issue

As expected, it seems like there is no way to tell when an app has died if it is using the ParentWindow approach to MessagePorts. Works fine with IFrames as you can use webSockets etc. but this is a problem for when there isn't an iframe.

I propose we have a call/response from the agent, so the agent can ping the app with "are you alive?" and the app must respond "yes" in some way..

kriswest commented 2 months ago

So... it appears that there is an onclose() handler on a MessagePort: https://html.spec.whatwg.org/multipage/web-messaging.html#handler-messageport-onclose Although for some reason MDN doesn't mention it. That should fire when the port is explicitly closed with close() or if the port is garbage collected. That should catch the majority of cases but needs testing - there is no canIuse page for the onclose handler: https://caniuse.com/?search=MessagePort%20API

The client code could/should also use onbeforeunload to send a WCP6Goodbye message when it detects closing to catch more cases - but there will always be the render crash case to handle... A heartbeat message is a common way to solve for that and it is what we've discussed before. Perhaps I should rename WCP6Goodbye to WCP7Goodbye and introduce a WCP6Heartbeat and WCP6HeartbeatResponse? Or this could be communication protocol message... maybe thats better.

I would have suggested we don't need a response if postmessage throws an error when it can't send - but I don't think it does. The messageError event seems to be about receiving messages you can't deserialize so is no help.

Let me know your preference for WCP or comms protocol message and I'll add something.

robmoffat commented 2 months ago

I don't know if this will work. The message port is created on the agent side and sent to the client. Potentially, it could be sent to lots of different places... so there's no way for the port itself to know when to fire a close.

Similarly, onbeforeunload might not occur if the window dies or hangs.

Postmessage I'm not worried about - this is a request/response process with a timeout so it's not important there. This should be part of the main protocol, not the connection protocol

kriswest commented 2 months ago

Either end can close their port and the event fires on the other port .

The close event will be fired even if the port is not explicitly closed. The cases where this event is dispatched are:

the close() method was called; the Document was destroyed; or the MessagePort was garbage collected. We only dispatch the event on otherPort because initiatorPort explicitly triggered the close, its Document no longer exists, or it was already garbage collected, as described above.

https://html.spec.whatwg.org/multipage/web-messaging.html#handler-messageport-onclose

So the port itself does know on an explicit port.close() call (which you can make in an onbeforeunload handler), the document is destroyed (on navigate/window close) or the port garbage collected. Hence, the edge case we're talking about is on render thread crash (window dies or hangs) - and only if the browser doesn't handle that and still cause close. You can simulate that case by killing the thread in the chrome taskmanager activity monitor (use taskmanager to get PID).

Do you wan to have a play with port.onclose() before I add a heartbeat message, or do we just go for it?

robmoffat commented 2 months ago

I've just tried adding an onclose - it doesn't seem to work for me.

kriswest commented 2 months ago

In which case? the other side calling close(), closing the window? Force kill the render thread?

Also in what browser? Chrome, Safari, FF? It looks like a later addition to the spec...

Hmm, they were implementing it back in Nov 2022, then there was some problem with a security review: https://github.com/whatwg/html/issues/1766#issuecomment-1958782062 There's an open issue to add it to MDN: https://github.com/mdn/mdn/issues/517 THeres a propsal to restrict when it fires to deal with the CHrome teams's security feedback (its exposing when GC occurs, which is a problem for reasons...): https://github.com/whatwg/html/issues/10201

If this isn't usable, we'll need to look at a heartbeat...

robmoffat commented 2 months ago

I tried with chrome 127 on Mac, just closing the tab to see what happens.

Although FDC3 for the web is all about MessagePort (which seems like it may or may not have onclose depending on the browser), I don't think we want to tie the protocol to this specifically.

I think we should add this in so that we are effectively agnostic on the underlying communication mechanism.

Also, I think it might be better to stick to the request/response design you've applied to the rest of the architecture, so DA requests a response from the app and marks it as dead if something doesn't come back in a certain time.

kriswest commented 2 months ago

@robmoffat I've added a heartbeatEvent message sent from the DA (containing a timestamp as its only data), and a heartResponseRequest for the app to send back (quoting the original timestamp from the heartbeatEvent in case anyone needs to relate them at some point). See: 0dbdf24737172ea8a00b52384b2be6a9c619c3e8

kriswest commented 2 months ago

N.b. heartbeat messages can be used for adHoc testing of whether an app is alive - for example they could be used to check intent resolver options and to discard/disconnect dead apps that should not appear.

kriswest commented 2 months ago

Notes from meeting #1321 and subsequent research, and continuing discussion from #1263 (where we added WCP6Goodbye):

WCP6Goodbye should be sent when the page is closing
- We'd discussed using beforeunload to send the goodbye, but thats the wrong event as it allows cancellation of the close and a different handler could cancel closing.
- unload is the event that is called afterwards when actually closing and would be appropriate - except that its been deprecated in the HTML specification - in part due to issues we're already aware of (it doesn't always fire), but also some implementation troubles. We're already planning to handle it not firing with a periodic heartbeat.
- The heartbeat can be used both periodically and for adHoc checks on whether an application is alive (e.g. check all options shown in the intent resolver and remove those that are dead).
- However, we also identified that there might be problems with heartbeats where apps are 'frozen' to preserve memory
- The correct alternative is likely to be using the Page Life Cycle API, which encompasses and extends the existing events:https://developer.chrome.com/docs/web-platform/page-lifecycle-api
- Use pagehide to detect windows closing (they are closing if the persisted flag is false)
- the freeze and resume events can be used to detect if the page is being frozen (which may mean it stops replying to heartbeats.

Correctly implementing heartbeat behaviour on page freeze is likely to require further exploration. There are some useful details on testing etc. in the FAQ on the Page LifeCycle API page: https://developer.chrome.com/docs/web-platform/page-lifecycle-api#faqs

For now, I recommend implementing the WCP6Goodbye message on pagehide with persisted false and exploring periodic use of a heartbeat in the implementation, before adding recommendations to the standard (to use the heartbeat periodically and how to deal with frozen pages). We should still add the message exchange and explain usage until this has been explored by someone.

@robmoffat @novavi

kriswest commented 2 months ago

@robmoffat I was briefly not sure we need the incidental use of heartbeatEvent and heartbeatAcknowledgement as a DA can check the closed property of a WindowProxy whenever it wants - this was going to be the original approach to liveness checks, with periodic polling as well. closed will be true if the window was closed or has navigated cross-domain.

However, I suppose there is an edge case in that window might navigate same domain, but no longer be an FDC3 app (e.g. a link that rendered a PDF or other doc)... I guess we do need the message exchange.

kriswest commented 2 months ago

I've added a write on disconnects to the Browser-Resident Desktop Agent specs: preview link: https://deploy-preview-1191--fdc3.netlify.app/docs/next/api/specs/browserDesktopAgents#disconnects

robmoffat commented 2 months ago

closed will be true if the window was closed or has navigated cross-domain.

Yes, it's the same security issue. I think you might want to strengthen your wording to avoid people using this by mistake, as it won't have the intended effect.

Hence, if an equivalent WindowProxy object is received from a different application the DA should consider the original application using that WindowProxy to have closed.

I didn't understand this part.

kriswest commented 2 months ago

closed will be true if the window was closed or has navigated cross-domain.

Yes, it's the same security issue. I think you might want to strengthen your wording to avoid people using this by mistake, as it won't have the intended effect.

Same security issue? In this case, navigating cross-domain does set the closed flag - there's just no event. I think the security review issue on the event is because its an event and it can be used to target an attack as the GC is running. Whereas, closed on WIndowProxy is a field you have to check and can't be used to time an attack.

Hence, if an equivalent WindowProxy object is received from a different application the DA should consider the original application using that WindowProxy to have closed.

I didn't understand this part.

I'll clarify this then - basically WindowProxy objects can be compared with ==. If they relate to the same window they are equal/equivalent. Hence, if you retain one, you can use it to check if other messages come from the same source.

kriswest commented 2 months ago

Clarified both sentences and added content on retaining instance details for closed instances (as they can come back from the dead / briefly die during navigation) a161739175f61bae42a9968907635dce50a1d65a

kriswest commented 2 months ago

Closed as completed

finos / FDC3

App Liveness Check #1320

Minor Issue