WICG / pending-beacon

A better beaconing API
Other
43 stars 9 forks source link

Request for comment on API shapes #9

Closed fergald closed 11 months ago

fergald commented 2 years ago

We are considering 3 different APIs and would appreciate feedback. I will include some pros and cons but in order for us to correctly weight these, please comment even if it has already been called out as a pro/con and it's important to you.

This issue focuses on how to set new data on the beacon and deal with a beacon that has already sent its data.

Low-level APIs

These are 2 versions of the API in the explainer. Sync vs async is about the API shape, a sync API does not mean that operations will block, waiting for external events, rather it means that the API does not use Promises and that state cannot spontaneously change mid-task.

Sync

We have setData and isPending and if isPending returns true then the beacon has not been sent yet and setData will succeed. We could also remove isPending and have setData throw an exception but that's not fundamentally different.

Pros

Cons

Async

We have only setData and no isPending. setData returns a Promise that will resolve if the beacon has not been sent yet and the data was successfully set. It will reject if the data could not be set.

The reason we drop isPending is because the result could be invalid by the time we try to act on it.

Pros

Cons

High-level API

There are two straight-forward use cases for the beacon that suggest higher level APIs. Both of these could be implemented using the lower level API above. The real question is whether these 2 high-level APIs are enough or do we need to expose the low-level API?

In both of these, there is no isPending or even a way to tell if data has been sent already.

Appending data

The beacon accumulates data and batches it up for sending. Policies like timeouts etc control how batching occurs (some data may be sent before the page is discarded). It guarantees (to the extent possible) that all data appended will eventually be sent.

The page never needs to check if the beacon has already sent some intermediate batch, it just keeps appending data.

Replacing data

The beacon's data is replaced by calls to setData. It doesn't matter whether the beacon has already sent data, it can always be replaced. Again, policies like timeouts etc control when sending occurs with a guarantee that the last set value will be definitely be sent.

Use case, e.g. reporting LCP values. The page just keeps setting the latest observed LCP, perhaps with a policy that says "don't leave data sitting around unsent for more than 5 minutes".

Discussion

An example of where these APIs might not work well is where the page would like to merge 2 metrics into 1 beacon if possible. With the low-level API, it would check if the beacon has been sent already and if not, replace the data with the combined data. This could reduce network traffic (although arguably connection reuse and header compression makes that a small benefit). It could also reduce processing cost by delivering related data already joined.

It may be that these APIs are capable of doing everything that's needed but impose costs on the backend.

It may also be that there are use-cases that simply cannot be met with these APIs.

Please let us know.

yoavweiss commented 2 years ago

/cc @nicjansma @cliffcrocker @andydavies @philipwalton

yutakahirano commented 2 years ago

dealing with rejected calls to setData is tricky, especially if multiple calls are in flight, full example in the explainer.

The call to setData does not block and so there may be multiple outstanding calls to setData now their catch code has to be coordinated so that only one replacement beacon is created and the latest data is set on the beacon (and setting that latest data will be async and subject to the same problems).

I still don't understand this. Are you thinking about different parts of the application access the same beacon simultaneously? If so, that sounds like an application problem rather than an API problem. PendingBeacon is a relatively precious resource, and if there are much more (say, 1M) requests than beacons, the application needs to deal with the problem by queueing/filtering/merging requests for example.

fergald commented 2 years ago

Since the answer is long and detailed and might generate more long and details answers I've replied to @yutakahirano in a new issue (#10). I'd like to leave this issue for user feedback on the API.

philipwalton commented 2 years ago

I've given this a fair amount of thought, and generally speaking I think there are two primary use cases for this API, and to answer the question posed in this issue, it's worth evaluating how well the options address each of these use cases:

(There may be other use cases I'm not thinking about, please respond if so.)

Saving user state

For the "saving user state" case, the goal would be for an app to be able to restore the user's last state the next time they visit. This is often done via client-side storage, but there could be cases where an app wants to preserve this state across devices or browsers for a given user.

For this use case, the "Replace data" high-level API works well, as you only ever want to send the most up-to-date user data; it never matters what the previous state is. Also in the rare case where the state data fails to be sent, it's only a minor inconvenience to the user, so it's well suited for this API.

That being said, the ability to overwrite data without ever having to check isPending and recreate the Beacon() is only slightly more ergonomic than doing that manually.

Sending analytics data

Most analytics providers operate using an event model, where consumers of their product can report events based on user interactions (or other triggers) and then the analytics code manages sending that data to their backend servers.

Because analytics providers tend to have lots of customers and those customers often have lots of users, the amount of data sent to their backends can be large, so anything that limits both the number of events sent as well as the size of those events can have a big impact.

For this use case, I do not think the "Replace data" high-level APIs work well because there's a tradeoff where you gain API ergonomic convenience at the expense of sometimes over-sending data or sending data more frequently than you would otherwise need.

Let me outline the scenario where that would happen:

  1. User visits a page running and make several critical interactions that the site wants to track via their analytics service
  2. The analytics service SDK adds that event data to the Beacon object with a pageHide timeout of 1 minute.
  3. The user navigates away and then back.
  4. At this point the analytics library doesn't know whether or not the Beacon data was sent, so it has to assume it wasn't and replace the Beacon data with all of the events it has stored in memory.

This is not ideal as it means that the backend now has to add logic to dedupe these events, and if you're an analytics provider servicing millions or billions of user visits, that can be a ton of processing cost.

The "Append data" high-level API would work well for some analytics use cases (e.g. if the event data never changes), but for use cases like RUM analytics where a performance metric value does change as the user is interacting with the page, the "Append data" API is also not ideal because (again) you'd have to write logic on your backend to dedupe that data, which can be expensive.

The "Low-level" sync API handles both of these cases nicely because it can make the determination for itself whether to add new data or replace existing data, and if it needs to replace data it can use the isPending flag to determine how to replace that data to minimize what is sent.

This could reduce network traffic (although arguably connection reuse and header compression makes that a small benefit).

I think this argument is valid for an individual user (e.g. the difference will have little impact on them, their experience, or their data usage), but I don't think it's valid for an analytics provider who has to pay the network and processing cost of every beacon it receives.

nicjansma commented 2 years ago

Appreciate you soliciting feedback on this @fergald, and I agree with @philipwalton's assessment.

Thinking about this question from the perspective of a RUM analytics provider, the low-level sync API feels the most flexible and natural.

In our RUM processing pipeline, the browser (e.g. via our boomerang.js RUM library) will send 1-n RUM beacons to the backend, generally aligned with major "events" that occur from the user. We always want to capture the Page Load's data, but will also send beacons for subsequent in-page interactions, SPA soft navigations, significant errors on the page, etc. We aim to keep each of those beacons as small as possible and the beacon payload "localized" to the most recent event, meaning its payload has data for the current event backwards up to the most recent beacon. In other words, after a beacon goes out, we start with "fresh data" for the next beacon.

This allows our back-end processing pipeline to analyze individual beacons without needing to restore or save context from other possible (but maybe zero) other beacons. The data that is duplicated on each beacon is general dimensional data (e.g. what the browser is, location, etc). Timers, Metrics, and Log-style data are generally limited to the most-recent-thing being measured.

As you and @philipwalton mention above, RUM can measure specific data points (that may still change over time), as well as events/logs (that can grow in number of entries over time). Some practical examples of both:

For both of these, being able to replace the current pending beacons' data is critical for us.

The way I'm envisioning boomerang.js using this API has been along the lines of the proposed low-level sync API. We'd like to continue sending beacons at our existing schedule of major events (Page Load, SPA Soft Nav, etc) with the ability to still "queue" data (for the most recent event) in case the page unloads itself.

For example, on a classic MPA (non-SPA) site:

  1. During Page Load, we prepare a PendingBeacon() and our own JavaScript var beaconData with skeleton data so something gets sent even if an abandon happens (e.g. dimensional data plus Page URL)
    • We'd probably set pageHideTimeout to ~1 minute or so
  2. As the Page Load hits onload we may queue additional data like the Page Load Timers, FCP, etc
    • With beacon.setData(beaconData); calls after updating the beaconData object
  3. We'd continue to add/update timers/metrics to the beacon, e.g. LCP candidates, JavaScript errors, etc
    • More beacon.setData(beaconData); calls
  4. We'd may have our own internal "timeout" that we'd .sendNow() by, say 5 minutes, so if the user keeps their browser open for 6 hours and doesn't interact with it, we can still give our customers a "real time" view of their visitors logging those interactions within 5 minutes
  5. As the visitor navigates away, the browser would magically send this full PendingBeacon payload
    • And this would solve our current headaches of having to send a Page Load beacon, plus an Unload beacon and stitch them together
  6. If we come back from a pagehide / BFCache restore, we'd probably flush the last beacon with .sendNow() and start a new PendingBeacon for whatever happens next

If you add a SPA site into the mix,

  1. If a new SPA route is triggered, we'd call .sendNow() to flush out the last event, re-skeletonize our var beaconData object, and create a new PendingBeacon that will track the SPA route data

Given our use case, the low-level sync API seems the most natural.

For the high-level APIs, I think ReplaceableBeacon would satisfy our needs? We would .replaceData(data) and .sendNow() as needed?

I have a couple questions, to make sure I understand though:

  1. It seems like a lot of the tradeoffs mentioned above by both Fergal and Philip are around needing to see the .isPending state, but I'm actually struggling to think of a case where we'd ever check .isPending. isPending would only happen "unexpectedly" (outside of a .sendNow()) after a pagehide, right? I think for our logic, we're always either forcing a send (due to a SPA navigation, restore, etc) or waiting for it to be sent after unload, I don't think we'd need to check .isPending anywhere... is that right? In Philip's step 4 of coming back from a restore, I think we'd just .sendNow() and start fresh regardless.
  2. For AppendableBeacon, what does appendData(data) do? If data is an object, is it merging in the properties of the last call to appendData(data) with the new properties of this data?
fergald commented 2 years ago

@nicjansma Thanks for the detailed feedback.

For your questions

  1. It does seem like you would want to use isPending() after coming back from BFCache, to check whether the pageHideTimeout kicked in or not and whether you need to start a new beacon or just replace.
  2. It's not a concrete plan, so whatever would work. I would imagine that we'd be somehow gluing datas together somehow to create a single final data in some encoding (for a POST beacon maybe it could be multipart form encoding).

The thing that would make the replace/append API insufficient is if you there are cases where you would replace only some of the payload with an update, e.g. you'd have a single beacon that was carrying Page Load Timers, LCP and CLS all in one beacon because then you would need to know whether something had already been sent, e.g. you don't want to send the PLT a second time if it's already been sent once. To use replace/append for that you'd probably want to put them all on different beacons, PLT only being set once but LCP and CLS would just keep getting updated and now you have 3 beacons instead of one.

philipwalton commented 2 years ago

@nicjansma

  1. It seems like a lot of the tradeoffs mentioned above by both Fergal and Philip are around needing to see the .isPending state, but I'm actually struggling to think of a case where we'd ever check .isPending.

It sounds like the main reason you don't see a use for isPending is that you plan to always send after a bfcache restore. However, if the API were to change from pageHideTimeout to a more general background timeout (as discussed here) then I image you would want to check isPending and not send the queued data if you didn't need to, correct?

nicjansma commented 1 year ago

@philipwalton apologies for the late reply, but that sounds correct!

mingyc commented 11 months ago

The API shape has been evolved into fetchLater() API after #70 and https://github.com/whatwg/fetch/pull/1647. Close it for now.