Reporting and Top-level Execution Timeouts

JacobGo commented 10 months ago

Prior proposals introduced seller-controlled timeouts for bidding and scoring logic (#293). It appears that Chrome applies a different, hardcoded 50ms timeout for reporting functions. After observing persistent flakiness in our automated tests due to reportWin and reportResult timeouts, we believe that 50ms is too tight for resource constrained environments and may also result in losing production billable events after rendering Protected Audience creatives.

Given reporting (1) occurs out of band from rendering the creative and (2) incurs a smaller constant time complexity (one winner and one or two sellers), we’re not sure there’s a need for such a strict timeout. We would like sellers to have the ability to control reporting timeouts, or for Chrome to perhaps lift the threshold to some number of seconds?

Furthermore, we would like to better understand the top-level execution timed out error we also observe in automated testing. Given that buyers and sellers most likely do not intend to do much work in the top-level scope, we would like to better understand which timer/timeout manages top-level execution and how we might debug these timeouts further (e.g. where should we look in the performance timeline to profile top-level execution).

dmdabbs commented 9 months ago

may also result in losing production billable events after rendering Protected Audience creatives.

Checking in. Is this something the Chrome team is considering?

rdgordon-index commented 9 months ago

https://github.com/WICG/turtledove/pull/906 has some useful insight regarding timeouts.

dmdabbs commented 9 months ago

Speaking of #906, it is still not committed.

dmdabbs commented 9 months ago

@JensenPaul , after today's WICG discussion can we 'elevate' this desire for reporting worklet timeout control to a "feature request?"

dmdabbs commented 9 months ago

@JensenPaul or @JacobGo, I returned to the discussion notes to see whether, in addition to raising the reporting functions timeout ceiling, the seller would be afforded a configuration setting, but the notes don't capture this.

My recollection is that you were open to introducing a seller-controlled value and raising the ceiling. No separate buyer/seller reporting timeout was proposed. KISS.

Would you kindly please confirm that? Also if this qualifies as a Non-breaking Feature Request would you also please apply that label?

Thank you.

JacobGo commented 9 months ago

Yes, seller defined reporting timeout controls would remedy the issue, although reusing the existing sellerTimeout and perBuyer(Cumulative)Timeouts auction controls would also suffice and avoid further bloating the auction config, at the cost of losing granularity to account for the smaller time complexity for reporting winners.

Making this a seller control would enable sellers to fine tune auction performance against page speed and measure the impact of the timeouts in production, although, as mentioned on the call, there is a strong incentive for all participants to guarantee delivery of the reporting ping as a billable event. If every seller and buyer decides to max out the timeout to Chrome's desired ceiling, shouldn't Chrome just raise the 50ms constant to this ceiling directly?

JensenPaul commented 8 months ago

I'm not a huge fan of reusing sellerTimeout and perBuyerTimeout as these are timeouts for things that are expected to happen many times, versus reporting scripts which are invoked only once. I'm also not a huge fan of reusing perBuyerCumulativeTimeout as it's something meant to apply to both network and compute time, and it's unclear if we'd want to make reporting time part of the cumulative time (which wouldn't work well for a buyer who used all the cumulative timeout to make their bids and then had no time to report) or not part of the cumulative time (which would be an odd doubling of timeouts that aren't particularly related). I'd caution against over-optimizing before we see how folks use this setting. I'd also caution against micro-optimizations like trying to reuse other values for this setting. I think it might be simplest to add one or two new timeout controls to cover reportWin() and reportResult() and then adjusting later if they're found to be redundant.

JensenPaul commented 8 months ago

We discussed this API change in yesterday's WICG Protected Audience meeting. There was feedback that there are other Web APIs that similarly control giving APIs time to finish up activity from a web page, namely Fetch's keepalive setting and ARA's similar functionality. There was feedback that per-buyer reporting timeouts didn't seem necessary.

Seems like a simple reportingTimeout field in the auction config that specifies the milliseconds that reporting scripts are allowed to run, perhaps capped at 5s, would address the needs.

dmdabbs commented 7 months ago

Yes, that sounds good, @JensenPaul

laurb9 commented 7 months ago

Our observation is that the successful execution of reportResult with respect to the timeout is adversely impacted by larger js file size when reportResult size remains the same, so it seems the timeout covers more than just the execution.

It was mentioned in the meeting that the js is recompiled before running reportResult, and that this time is included in the timeout. So other than tightening the timeout scope or reusing frozen context, splitting the reporting script into its own resource could also address this issue.

dmdabbs commented 7 months ago

To the discussion today about separating buyer bidding and reporting code into two scripts, with the former not subject to k-anon, @thegreatfatzby's suggestion for this that @michaelkleber recalled is here.

dmdabbs commented 7 months ago

Speaking of timeouts...https://github.com/WICG/turtledove/issues/1083

MattMenke2 commented 7 months ago

Our observation is that the successful execution of reportResult() with respect to the timeout is adversely impacted by larger js file size when reportResult size remains the same, so it seems the timeout covers more than just the execution.

So to runReportResult() or scoreAd(), there are 4 phases (ignoring trusted signals, WASM scripts, waiting for input to be ready, waiting for process quota, getting/creating a process)

1) Download Javascript file. 2) Parse Javascript file. 3) Create a fresh Javascript context and run the Javascript file. This runs the top-level code. It runs code at global scope, sets up scoreAd() and reportResult() (Which are both in the same file), etc. 4) Run the scoreAd() / reportResult() method.

The script timeouts affect 3) and 4), which we always do each time we run a seller script. 1) and 2) are not counted.

Note that for top-level seller scripts, we reuse the results of 1) and 2) between scoreAd() and reportResult(), but component seller scripts we often do not. We don't know which component seller will win, so we free up the component seller's resources as soon as its done scoring ads to free up resources and stop taking up some of Chrome's global seller worklet process quota.

MattMenke2 commented 7 months ago

An issue came up during review about what we should do in the case of component auctions.

There are 3 reasonable behaviors for component sellers + bidders that occur to me: 1) Take the min of top-level auction value and component auction value. 2) Use the component auction value if present, otherwise take the top-level value, if present. Otherwise, use the default. 3) Only use the component auction value, if present. Top-level value only applies to the top-level auction.

All the other timeouts work like 3). I think we should stick with 3) for this, until/unless we reach consensus on moving all other timeouts to another model, or decide we specifically do want different things for different timeouts. I filed issue 1094 specifically to discuss this.

Note that the CL that was sent to me did 2) - I think once we implement a new behavior, it becomes harder to change course, and I don't think 2) is obviously the right solution, so think it's best to be consistent for now.

dmdabbs commented 7 months ago

All the other timeouts work like 3). I think we should stick with 3) for this, until/unless we reach consensus on moving all other timeouts to another model, or decide we specifically do want different things for different timeouts. I filed issue 1094 specifically to discuss this.

Agree 👍

JacobGo commented 7 months ago

Staying consistent with (3) sounds reasonable to us given there's no dependency from top-level reporting on component reporting IIUC.

qingxinwu commented 4 months ago

It was rolled out to 1% stable on June 05, and will hopefully go to Stable 100% this week

dmdabbs commented 4 months ago

It was rolled out to 1% stable on June 05, and will hopefully go to Stable 100% this week

Thank you for the update. But there is a lag in the feature detection, at least on a web page:

const reportingTimeoutEnabled = navigator.protectedAudience ?
    navigator.protectedAudience.queryFeatureSupport("reportingTimeout") : false;

Since queryFeatureSupport remains unavailable beyond Canary:

navigator.protectedAudience.queryFeatureSupport
VM125:1 Uncaught TypeError: Cannot read properties of undefined (reading 'queryFeatureSupport')
    at <anonymous>:1:2

morlovich commented 4 months ago

navigator.protectedAudience should show up if any feature you can detect via queryFeatureSupport exists, which includes features that have been launched recently (but which you may not have had at the time of your comment); it certainly shows up in the stable version I am typing this on.

Please notice, however, that the guard is on navigator.protectedAudience, not navigator.protectedAudience.queryFeatureSupport --- it's the former that may be missing depending on the flags; if it's there than the latter is always available.

qingxinwu commented 4 months ago

rolled out to 100% stable yesterday (June 26)

WICG / turtledove

Reporting and Top-level Execution Timeouts #959