Open abrik0131 opened 1 year ago
Hi Alex, I'd like to pull apart the different types of events you'd like to monitor. Taking them out of order, if you don't mind:
- crashes that occur in Fenced Frames as they run rendering JS code
For Fenced Frames that have network access, nothing new is needed for this, right? It's just like everything in the browser today?
- Failures on URL fetches, including timeouts
In previous FLEDGE calls we had discussed this kind of reporting using https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md. Are you saying that aggregated reports like this are not good enough?
- crashes that occur in seller auction worklets
- crashes that occur in buyer bidding worklets
These are the tricky ones: if they were to include stack traces, then the worklets could use the error reporting mechanism to smuggle out all the information that we want to avoid leaking, including e.g. joined publisher-site and advertiser-site identities. Have you thought about ways this could be made privacy-safe?
In previous FLEDGE calls we had discussed this kind of reporting using https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md. Are you saying that aggregated reports like this are not good enough?
Aggregated reports could be used in principle. However, we are concerned that the aggregated reports will not satisfy our delay requirements of approximately 5 minutes. SRE would need access to the monitoring time series data in real-time, with a maximal delay of approximately 5 minutes for critical metrics during emergency response. If the delay is too large, this would increase the incident detection time and we may incur significant revenue loss by the time we detect that there is an incident. In addition, a larger delay in the monitoring data would make it harder for us to verify whether a mitigation is effective and would delay the time required to mitigate outages.
These are the tricky ones: if they were to include stack traces, then the worklets could use the error reporting mechanism to smuggle out all the information that we want to avoid leaking, including e.g. joined publisher-site and advertiser-site identities. Have you thought about ways this could be made privacy-safe?
I understand the privacy concern. Could stack traces be available temporarily? If not, could we get some information about a crash type?
@abrik0131 in theory, if the aggregated report was available immediately with a data freshness that's within that 5 minute delay SLA, does that satisfy the requirement? Are you asking us to only solve for the delay and not the granularity of the data?
if the aggregated report was available immediately with a data freshness that's within that 5 minute delay SLA, does that satisfy the requirement?
Yes, that satisfies the requirement.
Are you asking us to only solve for the delay and not the granularity of the data?
Yes, we are interested in the total number of errors/reports. The main issue is getting the data quickly enough.
@abrik0131 is it possible to understand a bit more of the impact of having this data not at 5, but let's say at 120 minutes (2 hours) or any intermediate intervals (15 minutes, 30 minutes, 60 minutes) you would be palatable to accept?
The error reporting API is intended to quickly detect the most severe types of outages, where in the worst-case 100% of remarketing queries would crash or return errors. This would both impact our revenue and impact publishers which would not receive remarketing ad revenue for the duration of the outage.
The longer the delay is, the longer it would take for us to apply mitigations or press red buttons which could mitigate the outage impact. If there was say a 1 hour delay, then in the worst-case 100% of remarketing queries would fail for an extra hour before we could detect the outage.
The monitoring delay would also impact the time to mitigate an outage in the case where pressing our red buttons fails to mitigate the outage. In this case, our debugging workflow involves making changes to the serving configuration and then waiting for a signal on whether the change was effective or not, and a longer monitoring delay would increase the time required for each iteration. In this case, an extra 1 hour of monitoring delay would correspond to several hours of outage impact. For example, if we have 4 emergency red buttons (config changes, rollbacks, etc.) to press, we would need to press one button at a time and wait 1 hour to get a signal to see if the mitigation attempt was successful. This could bring the time to mitigation to 5 hours after detection or possibly more if other mitigation mechanisms are required.
As to whether a longer delay is acceptable, this would depend on what fallback mechanisms we have available in FLEDGE in the case of crashes or errors. For example, if there was already a fallback in the browser to show the contextual ad, this would reduce the impact compared to the browser showing no ad in the case of a crash.
In general, we would strongly prefer the reporting delay to be 5 minutes or less.
@abrik0131 another question, can you please help us understand, from the above summary which events and data requested are:
forDebuggingOnly
function? Details see the explainer section on this loss reporting function here: https://github.com/WICG/turtledove/blob/main/Proposed_First_FLEDGE_OT_Details.md#reporting ANDWe suspect the answer is "none" but it's important to clarify, as forDebuggingOnly
needs to be retired by 3PCD
To monitor bidding and scoring JavaScript crashes, is it sufficient for the JavaScript to catch the exceptions and report them via a sampled reporting API like this?
We wanted to repost with our updated understanding of our requirements since the original post is now over nine months old and a lot has changed including the introduction of a proposal for sampled debug reporting.
Real-time monitoring is intended for speedy detection of issues from histograms and time series where we largely don’t care about individual data points in isolation but where we do care about sudden changes in aggregated metrics across all auction invocations (e.g. increased latency or crash rate or bid anomalies to name a few). We would like to emphasize that it’s not limited to JavaScript and worklet crashes, which was the focus of our previous discussions.
Our requirements are:
Something roughly shaped like the private aggregation API gives us most of our mileage should we be able to guarantee O(minutes) delay for conceivably a very large stream of events.
We believe the requirements are complementary to those of Issue 162’s proposal for sampled debug reporting since for monitoring we want full browser/auction coverage with lower entropy and without locking out browsers. This is in contrast to the new debug sampling proposal’s strengths for root cause determination where we will likely need only a small number of samples with significantly higher entropy.
Do you expect buyers to want their own monitoring, independent from permitting sellers to see their failures?
(Also note that the sampling, the way the debugging API does, probably makes it easier to do things realtime, privacy-wise).
An addition to the types of events we would like monitor:
👍 5. crashes that occur in buyer reporting worklets
While not real time, this "dashboard" is linked somewhere from DevRel docs: https://pscs.glitch.me/ Could all Privacy Sandbox-related UMA/UKM metrics that Chrome captures be accessible for all Sandbox stakeholders/collaborators?
The Privacy Sandbox team is looking into this issue and seeks feedback from adtechs on the following:
We look forward to feedback from buyers and sellers. Please let us know if you have any questions.
It's mentioned indirectly above, but can we simply borrow the logic of the forDebuggingOnly
-style endpoints -- i.e. ReportAdAuctionTimeout
and ReportAdAuctionError
-- and that way there is less friction to adoption of these key events, while you work out a more robust, long-term solution? IMHO we need access to this information today, as event level reporting, without having to look towards the aggregated/noised APIs in the interim.
@rdgordon-index, forDebuggingOnly
is available and can be used to report many of the events that the API being discussed here can. forDebuggingOnly
however addresses substantially different reporting needs: forDebuggingOnly
is meant for root-cause-analysis which requires sending much more information (e.g. a stack trace or internal state) and permits no noising, versus this API which is meant for real-time monitoring where the goal is to quickly detect a problem which requires only sending which bucket the failure is in but no other information and permits high levels of noising. These significant differences in use cases and information sent dictate significantly different privacy restrictions: forDebuggingOnly
downsampling has long cooldown and lockout periods while the API being discussed here might potentially be used in most auctions.
forDebuggingOnly
is useful in cases where you want to report back highly valuable information that can be used to debug critical issues (e.g. a bidding script producing aberrant bids). The Real Time Monitoring API can be used for monitoring the health of on-device activities where some background level of error reports is expected (e.g. auction timeouts caused by devices in bad/overloaded states).
@rushilw-google An additional API for real time monitoring sounds useful to us at RTB House. We are evaluating the questions you posted, but in the meantime I wanted to ask: what would be an ETA for the new api to be available? Whether this API is available before the 3pc phase-out is a crucial factor in prioritizing on our end.
versus this API which is meant for real-time monitoring where the goal is to quickly detect a problem which requires only sending which bucket the failure is in but no other information and permits high levels of noising.
Understood @JensenPaul -- but to @jonasz's point, my concern is regarding the timeline; a net-new API seems further out than potentially re-purposing the existing debugging framework whilewhich is not being downsampled pre-3PCD.
What is the desired SLA for issue detection (expressed in O mins)? By SLA, we mean the time between (a) an issue starting to occur on the client (browser) and (b) the adtech being able to detect the issue.
Low single-digit minutes -- we're extremely sensitive to any impact to potential interruptions.
What are the critical metrics that require detection within the desired SLA window and can not be detected with event-level win reporting?
Anything that prevents scoreAd
and reportResult
(and sendReportTo
) from completing correctly -- exceptions and timeouts, for example -- that are invisible to the existing event-level reporting.
Thanks @jonasz and @rdgordon-index for your comments. We recognize the tight timeline to 3PC phase-out and we are working on publishing an explainer as soon as possible, where we’ll publish the timeline after due consideration. To reduce the effort at the time of adoption, this API will likely share similar ergonomics to those of Private Aggregation API, with some differences that allow the SLA to be near real time. We encourage adtechs to continue sharing requirements at this stage as inputs to the design.
I don't think it should be limited to crashes, after all, anyone can have logic errors and those could be even more expensive than crashes.
Hello @rushilw-google ,
Thank you for drafting the explainer. It looks very promising!
We wanted to raise a certain doubt regarding opt-in: In our understanding, opt-in should be independent of auction configuration (provided by the seller). A much more convenient solution would be if the buyer could indicate that they want to receive reports - this could be done at the time of attestation or through a separate http request.
Thanks, Michal
Hi @michal-kalisz, thanks for your note! Regarding opt-in, we found it helpful to consider four aspects of the options for the opt-in mechanism - privacy implications, performance implications, build complexity/time and adtech preference beyond these reasons.
With that lens, sharing our thoughts on the options below:
1) Opt-in through buyer-seller coordination in auctionConfig: This is the proposed option and has privacy characteristics similar to trusted signals fetches, is quickest to implement and has minimal performance implications. We recognize that buyer-seller coordination would be required for this option but consider this feasible as some existing features already necessitate coordination, as in the case of perBuyerSignals
.
2) Opt-in through attestation: This option is under evaluation for feasibility as attestations were not originally designed for this purpose. We are also considering the performance implications and build complexity. We will post an update once this evaluation is completed.
3) Opt-in through HTTP request: On the performance side, added cost and latency are the primary concerns with creating a new HTTP request for opt-in declaration. To mitigate these concerns, adtechs could include opt-in declaration in the HTTP response header of the bidding or scoring script fetch requests, but if there are errors in fetching the script or downstream from that, Real Time Monitoring would not be able to send any reports for those errors. This does not seem preferable for adtechs.
We plan to launch Real Time Monitoring with the buyer-seller coordination mechanism while continuing the discussion on other mechanisms, if you have any further thoughts on those.
Much of the proposal has been implemented in Chrome and can be tested in Chrome Canary today using the --enable-features=FledgeRealTimeReporting
command line flag.
Hello,
If much of the proposal has been implemented, does it mean if real-time reports are emitted, we should be able to receive them on the .well-known/interest-group/real-time-report
already? If so, how are they serialized? It is not explicit yet in the explainer
The serialization format of the report is TBD
Has the format been decided yet? Is it a raw vector of 1024 bytes with value 0 or 1, or is it more elaborate?
The Real Time Monitoring API should now be available for testing in 50% of Chrome Canary and Dev channel traffic.
we should be able to receive them on the .well-known/interest-group/real-time-report already?
Yes, if the API is enabled via the command line flag or as part of testing on pre-stable Chrome traffic, and your origin is opted in via the auction config.
If so, how are they serialized?
Running Canary with cmdline flag I confirm the POST requests are sent.
t=136077 [st=170] HTTP_TRANSACTION_SEND_REQUEST_HEADERS --> POST /.well-known/interest-group/real-time-report HTTP/1.1
....
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
....
If you are also interested in receiving low entropy UACH so that you have the option to differentiate Mobile, &c. real-time events, note your interest here.
The Real Time Monitoring API should now be available for testing in 50% Beta channel traffic.
Hello @qingxinwu. The FledgeRealTimeReporting
cmdline flag is (for me) no longer enabling the feature in Canary.
The Real Time Monitoring API should now be available for testing in 1% Stable channel traffic.
@JensenPaul, @qingxinwu I figured out why the FledgeRealTimeReporting
cmdline flag would not enable the feature; my setup was also configuring Chrome into one of the facilitated testing labels.
Thanks for the update @JensenPaul !
Just to clarify, is the FledgeRealTimeReporting
flag necessary to maybe be part of the 1% traffic on which the feature is enabled? Or is the 1% stable traffic independent on the flag? Or does having the flag on mean a report will be sent to the buyer for 100% of the auctions for which the buyer opted-in to the API seller-side?
No, the flag is not needed to be part of the 1% traffic. But if your browser is not in the group of clients that have this feature enabled (e.g., the 1% stable), you can launch your browser with the flag to manually enable the feature.
Also see the real time reporting explainer about which reports are sent when the feature is enabled.
Hi @qingxinwu,
From this comment, I understand the API is enabled on non-labeled traffic, is that the case? Was this in any documentation?
I also notice in my browser that when I force a label in my browser, no report is sent, while they do get sent when I don't force a label.
I am asking because, we have everything in place on our side to receive reports, we have a couple of SSPs configured to send us reports, yet we still don't receive any pings. If the API is only enabled on non-labeled traffic, this could be a problem for us, as we can only participate in Protected Audience auctions on labeled traffic.
Hi @ccharnay67 yes, you are correct that this Real Time Monitoring feature was not enabled in Mode A and Mode B facilitated testing labeled traffic
If the API is only enabled on non-labeled traffic, this could be a problem for us, as we can only participate in Protected Audience auctions on labeled traffic.
@ajvelasquez-privacy-sandbox, it is my understanding that we are beyond the CMA-aligned/mandated market testing period. Is the pattern of new PA features being excluded from labeled instances a CMA mandate?
See also deprecatedRenderURLReplacements
which has a similar holdback on labeled traffic, https://github.com/WICG/turtledove/issues/286#issuecomment-2120319985
Thank you for your answers. Would you consider enabling the feature for mode A or B, or maybe even just on one label?
The comment you mention @JacobGo states that the other feature was held back on labelled traffic "to avoid disrupting ongoing experiments and testing". But this was in May, close to the CMA test, and we believe this argument may not be as strong now.
Furthermore, one of the goals of the Real-Time Monitoring API is to be useful when doing experiments and testing, it should allow us to detect issues in such settings. So, in our opinion, it makes a lot of sense to enable it on labelled traffic.
Thank you for bringing this to our attention. We've been considering this and will post an update when we've made a decision
The Real Time Monitoring API should now be enabled in 100% Stable channel traffic, starting from 129.0.6668.77
The Real Time Monitoring API should now be enabled in 100% Stable channel traffic
Hi @qingxinwu, could you please precise if it is still limited to non-labeled traffic? We are still not getting any reports on our end.
The Real Time Monitoring API should now be enabled in 100% Stable channel traffic
Hi @qingxinwu, could you please precise if it is still limited to non-labeled traffic? We are still not getting any reports on our end.
Yes, it's still limited to non-labeled traffic at this moment. We'll post an update when a decision about labeled traffic is made.
Hello, we wanted to give an update to the ecosystem in how we plan to prepare the Real-Time Monitoring API for an environment in which some users choose to allow 3PCs, and other users do not, following our July 2024 announcement on a new path for Privacy Sandbox. For traffic in which 3PCs are allowed there is no additional privacy risk from the browser sending un-noised Real Time Monitoring reports compared with 3PC-based debugging. This means that for users who have allowed a 3PC it is possible for the Real Time Monitoring API to remain un-noised and so provide additional precision without compromising privacy.
Therefore we propose the following changes using what is already published in our Real-Time Monitoring API explainer as the starting baseline:
When the user chooses to allow 3PCs in the current impression site, we will not proceed with the nosing algorithm.
In order for adtech to understand whether the particular contribution is noised or un-noised we would need to add a flag to each Real Time Monitoring report indicating if this report is noised. We see adding a floating point field called something like “flipProbability” as a way to do this. If the contribution is not noised, the value would be 0. If its contribution is noised then we would output the actual probability which with our current config of epsilon = 1, would be around 0.378.
We welcome the ecosystem comments on this proposal!
flipProbability
sounds helpful.
My feeling is that it would actually be simpler (from our perspective) to always noise however, since that would make it simpler, giving us one less possibility to deal with. The noise shouldn't be particularly bad in any case.
Hello @ajvelasquez-privacy-sandbox,
Thank you for this proposal, we would be happy with it. Based on our current observations on Real Time Monitoring, the noise makes it difficult to observe errors which are present in low levels, even after denoising. The flipProbablity
field makes it easy to distinguish both cases, and the unnoised reports would allow us to have some sanity checks, or at least a baseline to look at in addition to the noised reports.
A couple more comments:
Given this is not the only proposal around deactivating noise for users who choose to allow 3PC, would you consider deactivating it more extensively for such users, e.g. for reportWin fields subject to a noising scheme (joinCount, recency, mdoelingSignals)?
More specifically, for Real Time Monitoring, we understand that if a browser is trying to send multiple contributions for a given auction to a given buyer, one would still be randomly picked to be sent. Could you consider sending all contributions for users who opt-in to 3PC? Or at least one per interest group?
Again, thank you for the proposal, it sounds very useful to us.
Client side ad auctions described in FLEDGE create new challenges for detecting failures in real time and reacting to them in order to avoid expensive outages. The main factor contributing to these new challenges is decreased visibility of client side failures. To improve visibility of client side failures we propose the following extension of FLEDGE whose purpose is to support effective monitoring of client side auctions.
Events to be monitored
There are several types of events that we would like to monitor. These are:
For the crashes, i.e. events (1)-(3) we would like Chrome to send the following types of data:
For the failures on URL fetches we would like Chrome to send the following types of data:
Registering monitoring URLs
Reporting URLs will be provided in the auction config.
sellerErrorReportingUrl
paramperBuyerErrorReportinggUrl
paramsellerErrorReportingUrl
.perBuyerErrorReportingUrl
for the appropriate buyer.perBuyerErrorReportingUrl
for the appropriate buyer.biddingLogicUrl
,dailyUpdateUrl
,trustedBiddingSignalsUrl
,renderUrl
toperBuyerErrorReportingUrl
for the appropriate buyer. b.trustedScoringSignalsUrl
tosellerErrorReportingUrl
.Sending monitoring notifications
Upon each event, i.e. a crash in the seller or buyer worklets, or a crash in Fenced Frames, or a timeout, Chrome will send a notification to the registered URL, with the following URL params:
eventType
containing the type of the event. For sellers eventType will be omitted. For buyers, three values are possible:'bidding'
,'fencedframes'
, or'timeout'
.workletVersion
containing the worklet code version.stackTrace
containing stack trace.chromeVersion
containing Chrome version.In case of a failure on a URL fetch, additional param containing URL causing the failure will be added:
fetchFailureURL
containing the timed out URL.The notifications will look as follows. For the crash in the seller worklet
For the crash in the buyer worklet
For the crash in the fenced frames
For the failure on URL fetch