WICG / turtledove

TURTLEDOVE
https://wicg.github.io/turtledove/
Other
520 stars 222 forks source link

Real Time Monitoring API for FLEDGE #430

Open abrik0131 opened 1 year ago

abrik0131 commented 1 year ago

Client side ad auctions described in FLEDGE create new challenges for detecting failures in real time and reacting to them in order to avoid expensive outages. The main factor contributing to these new challenges is decreased visibility of client side failures. To improve visibility of client side failures we propose the following extension of FLEDGE whose purpose is to support effective monitoring of client side auctions.

Events to be monitored

There are several types of events that we would like to monitor. These are:

  1. crashes that occur in seller auction worklets
  2. crashes that occur in buyer bidding worklets
  3. crashes that occur in Fenced Frames as they run rendering JS code
  4. Failures on URL fetches, including timeouts

For the crashes, i.e. events (1)-(3) we would like Chrome to send the following types of data:

  1. worklet code version
  2. stack trace of the crash
  3. chrome version

For the failures on URL fetches we would like Chrome to send the following types of data:

  1. worklet code version
  2. chrome version
  3. URL causing the failure
  4. Error code that can distinguish between timeouts and other types of failures

Registering monitoring URLs

Reporting URLs will be provided in the auction config.

  1. Seller monitoring URL via sellerErrorReportingUrl param
  2. Buyer monitoring URL via perBuyerErrorReportinggUrl param
const myAuctionConfig = {
   'sellerErrorReportingUrl': 'https://www.seller.com/monitoring',
   'perBuyerErrorReportingUrl': {'https://www.example-dsp.com/':
                           'https://www.example-dsp.com/monitoring',
                                    'https://www.another-buyer.com/':
                           'https://www.another-buyer.com/monitoring',
                        ...},
   …
};
const auctionResultPromise = navigator.runAdAuction(myAuctionConfig);
  1. Info about crashes in the seller auction worklets will be reported to sellerErrorReportingUrl.
  2. Info about crashes in the buyer auction worklets will be reported to perBuyerErrorReportingUrl for the appropriate buyer.
  3. Info about crashes in the fenced frames will be reported to perBuyerErrorReportingUrl for the appropriate buyer.
  4. Info about failures on URL fetches will be reported as follows: a. ​​biddingLogicUrl, dailyUpdateUrl, trustedBiddingSignalsUrl, renderUrl to perBuyerErrorReportingUrl for the appropriate buyer. b. trustedScoringSignalsUrl to sellerErrorReportingUrl.

Sending monitoring notifications

Upon each event, i.e. a crash in the seller or buyer worklets, or a crash in Fenced Frames, or a timeout, Chrome will send a notification to the registered URL, with the following URL params:

  1. eventType containing the type of the event. For sellers eventType will be omitted. For buyers, three values are possible: 'bidding', 'fencedframes', or 'timeout'.
  2. workletVersion containing the worklet code version.
  3. stackTrace containing stack trace.
  4. chromeVersion containing Chrome version.

In case of a failure on a URL fetch, additional param containing URL causing the failure will be added: fetchFailureURL containing the timed out URL.

The notifications will look as follows. For the crash in the seller worklet

https://www.seller.com/monitoring&workletVersion=<worklet_version>&stackTrace=<stack_trace>&chromeVersion=<chrome_version>

For the crash in the buyer worklet

https://www.example-dsp.com/monitoring&eventType='bidding'&workletVersion=<worklet_version>&stackTrace=<stack_trace>&chromeVersion=<chrome_version>

For the crash in the fenced frames

https://www.example-dsp.com/monitoring&eventType='fencedframes'&workletVersion=<worklet_version>&stackTrace=<stack_trace>&chromeVersion=<chrome_version>

For the failure on URL fetch

https://www.example-dsp.com/monitoring&eventType='fetchfailure'&workletVersion=<worklet_version>&chromeVersion=<chrome_version>&fetchFailureURL=<URL>&errorCode=<error>
michaelkleber commented 1 year ago

Hi Alex, I'd like to pull apart the different types of events you'd like to monitor. Taking them out of order, if you don't mind:

  1. crashes that occur in Fenced Frames as they run rendering JS code

For Fenced Frames that have network access, nothing new is needed for this, right? It's just like everything in the browser today?

  1. Failures on URL fetches, including timeouts

In previous FLEDGE calls we had discussed this kind of reporting using https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md. Are you saying that aggregated reports like this are not good enough?

  1. crashes that occur in seller auction worklets
  2. crashes that occur in buyer bidding worklets

These are the tricky ones: if they were to include stack traces, then the worklets could use the error reporting mechanism to smuggle out all the information that we want to avoid leaking, including e.g. joined publisher-site and advertiser-site identities. Have you thought about ways this could be made privacy-safe?

abrik0131 commented 1 year ago

In previous FLEDGE calls we had discussed this kind of reporting using https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md. Are you saying that aggregated reports like this are not good enough?

Aggregated reports could be used in principle. However, we are concerned that the aggregated reports will not satisfy our delay requirements of approximately 5 minutes. SRE would need access to the monitoring time series data in real-time, with a maximal delay of approximately 5 minutes for critical metrics during emergency response. If the delay is too large, this would increase the incident detection time and we may incur significant revenue loss by the time we detect that there is an incident. In addition, a larger delay in the monitoring data would make it harder for us to verify whether a mitigation is effective and would delay the time required to mitigate outages.

These are the tricky ones: if they were to include stack traces, then the worklets could use the error reporting mechanism to smuggle out all the information that we want to avoid leaking, including e.g. joined publisher-site and advertiser-site identities. Have you thought about ways this could be made privacy-safe?

I understand the privacy concern. Could stack traces be available temporarily? If not, could we get some information about a crash type?

ajvelasquezgoog commented 1 year ago

@abrik0131 in theory, if the aggregated report was available immediately with a data freshness that's within that 5 minute delay SLA, does that satisfy the requirement? Are you asking us to only solve for the delay and not the granularity of the data?

abrik0131 commented 1 year ago

if the aggregated report was available immediately with a data freshness that's within that 5 minute delay SLA, does that satisfy the requirement?

Yes, that satisfies the requirement.

Are you asking us to only solve for the delay and not the granularity of the data?

Yes, we are interested in the total number of errors/reports. The main issue is getting the data quickly enough.

ajvelasquezgoog commented 1 year ago

@abrik0131 is it possible to understand a bit more of the impact of having this data not at 5, but let's say at 120 minutes (2 hours) or any intermediate intervals (15 minutes, 30 minutes, 60 minutes) you would be palatable to accept?

abrik0131 commented 1 year ago

The error reporting API is intended to quickly detect the most severe types of outages, where in the worst-case 100% of remarketing queries would crash or return errors. This would both impact our revenue and impact publishers which would not receive remarketing ad revenue for the duration of the outage.

The longer the delay is, the longer it would take for us to apply mitigations or press red buttons which could mitigate the outage impact. If there was say a 1 hour delay, then in the worst-case 100% of remarketing queries would fail for an extra hour before we could detect the outage.

The monitoring delay would also impact the time to mitigate an outage in the case where pressing our red buttons fails to mitigate the outage. In this case, our debugging workflow involves making changes to the serving configuration and then waiting for a signal on whether the change was effective or not, and a longer monitoring delay would increase the time required for each iteration. In this case, an extra 1 hour of monitoring delay would correspond to several hours of outage impact. For example, if we have 4 emergency red buttons (config changes, rollbacks, etc.) to press, we would need to press one button at a time and wait 1 hour to get a signal to see if the mitigation attempt was successful. This could bring the time to mitigation to 5 hours after detection or possibly more if other mitigation mechanisms are required.

As to whether a longer delay is acceptable, this would depend on what fallback mechanisms we have available in FLEDGE in the case of crashes or errors. For example, if there was already a fallback in the browser to show the contextual ad, this would reduce the impact compared to the browser showing no ad in the case of a crash.

In general, we would strongly prefer the reporting delay to be 5 minutes or less.

ajvelasquezgoog commented 1 year ago

@abrik0131 another question, can you please help us understand, from the above summary which events and data requested are:

  1. Available today via the forDebuggingOnly function? Details see the explainer section on this loss reporting function here: https://github.com/WICG/turtledove/blob/main/Proposed_First_FLEDGE_OT_Details.md#reporting AND
  2. NOT available via the Extended Private Aggregation reporting functions? https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md

We suspect the answer is "none" but it's important to clarify, as forDebuggingOnly needs to be retired by 3PCD

JensenPaul commented 11 months ago

To monitor bidding and scoring JavaScript crashes, is it sufficient for the JavaScript to catch the exceptions and report them via a sampled reporting API like this?

abrik0131 commented 10 months ago

We wanted to repost with our updated understanding of our requirements since the original post is now over nine months old and a lot has changed including the introduction of a proposal for sampled debug reporting.

Real-time monitoring is intended for speedy detection of issues from histograms and time series where we largely don’t care about individual data points in isolation but where we do care about sudden changes in aggregated metrics across all auction invocations (e.g. increased latency or crash rate or bid anomalies to name a few). We would like to emphasize that it’s not limited to JavaScript and worklet crashes, which was the focus of our previous discussions.

Our requirements are:

  1. To be able to build histograms and time series for monitoring and alerting
  2. In near real-time: No more than O(single digit) minutes of delay
  3. On all invocations of the PA auction

Something roughly shaped like the private aggregation API gives us most of our mileage should we be able to guarantee O(minutes) delay for conceivably a very large stream of events.

We believe the requirements are complementary to those of Issue 162’s proposal for sampled debug reporting since for monitoring we want full browser/auction coverage with lower entropy and without locking out browsers. This is in contrast to the new debug sampling proposal’s strengths for root cause determination where we will likely need only a small number of samples with significantly higher entropy.

morlovich commented 9 months ago

Do you expect buyers to want their own monitoring, independent from permitting sellers to see their failures?

(Also note that the sampling, the way the debugging API does, probably makes it easier to do things realtime, privacy-wise).

abrik0131 commented 9 months ago

An addition to the types of events we would like monitor:

  1. crashes that occur in buyer reporting worklets
  2. crashes that occur in seller reporting worklets
dmdabbs commented 9 months ago

👍 5. crashes that occur in buyer reporting worklets

While not real time, this "dashboard" is linked somewhere from DevRel docs: https://pscs.glitch.me/ Could all Privacy Sandbox-related UMA/UKM metrics that Chrome captures be accessible for all Sandbox stakeholders/collaborators?

rushilw-google commented 7 months ago

The Privacy Sandbox team is looking into this issue and seeks feedback from adtechs on the following:

  1. What is the desired SLA for issue detection (expressed in O mins)? By SLA, we mean the time between (a) an issue starting to occur on the client (browser) and (b) the adtech being able to detect the issue.
  2. What are the critical metrics that require detection within the desired SLA window and can not be detected with event-level win reporting?
  3. What type of dimensions would be required along with the metrics? E.g. creative type
  4. How many buckets would be monitored? A bucket here is defined as the adtech-denoted combination of dimensions that gets reported. e.g. bucket X = crash in generateBid + banner.
  5. What is the desired change in metric (sensitivity) that should be detected in the given bucket?
  6. How many auctions and bids occur in the desired SLA window for your adtech? A directional number would be sufficient.
  7. What would be the tolerance for missed/delayed/false alerts?

We look forward to feedback from buyers and sellers. Please let us know if you have any questions.

rdgordon-index commented 5 months ago

It's mentioned indirectly above, but can we simply borrow the logic of the forDebuggingOnly-style endpoints -- i.e. ReportAdAuctionTimeout and ReportAdAuctionError -- and that way there is less friction to adoption of these key events, while you work out a more robust, long-term solution? IMHO we need access to this information today, as event level reporting, without having to look towards the aggregated/noised APIs in the interim.

JensenPaul commented 5 months ago

@rdgordon-index, forDebuggingOnly is available and can be used to report many of the events that the API being discussed here can. forDebuggingOnly however addresses substantially different reporting needs: forDebuggingOnly is meant for root-cause-analysis which requires sending much more information (e.g. a stack trace or internal state) and permits no noising, versus this API which is meant for real-time monitoring where the goal is to quickly detect a problem which requires only sending which bucket the failure is in but no other information and permits high levels of noising. These significant differences in use cases and information sent dictate significantly different privacy restrictions: forDebuggingOnly downsampling has long cooldown and lockout periods while the API being discussed here might potentially be used in most auctions.

forDebuggingOnly is useful in cases where you want to report back highly valuable information that can be used to debug critical issues (e.g. a bidding script producing aberrant bids). The Real Time Monitoring API can be used for monitoring the health of on-device activities where some background level of error reports is expected (e.g. auction timeouts caused by devices in bad/overloaded states).

jonasz commented 5 months ago

@rushilw-google An additional API for real time monitoring sounds useful to us at RTB House. We are evaluating the questions you posted, but in the meantime I wanted to ask: what would be an ETA for the new api to be available? Whether this API is available before the 3pc phase-out is a crucial factor in prioritizing on our end.

rdgordon-index commented 5 months ago

versus this API which is meant for real-time monitoring where the goal is to quickly detect a problem which requires only sending which bucket the failure is in but no other information and permits high levels of noising.

Understood @JensenPaul -- but to @jonasz's point, my concern is regarding the timeline; a net-new API seems further out than potentially re-purposing the existing debugging framework whilewhich is not being downsampled pre-3PCD.

rdgordon-index commented 5 months ago

What is the desired SLA for issue detection (expressed in O mins)? By SLA, we mean the time between (a) an issue starting to occur on the client (browser) and (b) the adtech being able to detect the issue.

Low single-digit minutes -- we're extremely sensitive to any impact to potential interruptions.

What are the critical metrics that require detection within the desired SLA window and can not be detected with event-level win reporting?

Anything that prevents scoreAd and reportResult (and sendReportTo) from completing correctly -- exceptions and timeouts, for example -- that are invisible to the existing event-level reporting.

rushilw-google commented 5 months ago

Thanks @jonasz and @rdgordon-index for your comments. We recognize the tight timeline to 3PC phase-out and we are working on publishing an explainer as soon as possible, where we’ll publish the timeline after due consideration. To reduce the effort at the time of adoption, this API will likely share similar ergonomics to those of Private Aggregation API, with some differences that allow the SLA to be near real time. We encourage adtechs to continue sharing requirements at this stage as inputs to the design.

raz-adroll commented 4 months ago

I don't think it should be limited to crashes, after all, anyone can have logic errors and those could be even more expensive than crashes.

ankwok commented 4 months ago

Hello @rushilw-google ,

  1. SLA should be single digit minutes
  2. Here are some metrics we are interested in (non-exhaustive)
    • Timeout and error metrics, even on processes that happen outside of generateBid. The current forDebuggingOnly API is limited in that hooks are registered within generateBid. If downloading of the bidding script or trusted bidding signals times out, we never get to this point and are blind to such errors. Also, if the seller aborts the auction, no debugging signals are sent. Ideally, we can identify key internal steps between the top-level seller calling runAdAuction() to a buyer’s generateBid() and get counters/errors sent to a buyer’s reporting endpoint. This would allow a buyer to construct an auction funnel in order to identify potential bottlenecks to returning a bid.
    • Eligibility metrics: How many times was our IG Owner present in an AuctionConfig and there exists IGs from our Owner on the browser.
    • Bidding script error rate. We should be able to get a relative error rate, i.e. 10/10 errors is bad, but 10/1e6 is probably OK. This is can be derived from the various counters above.
    • Percentile/histograms for generateBid() execution time.
  3. As a buyer, we would like to know the component/top-level seller where our owner was included as an interested buyer. Once native and video formats get support, we would like to include the creative type as well.
  4. We envision having O(10^3) buckets of interest
  5. We would like to be able to detect a difference of 1% per bucket
  6. We have O(10^8) events world-wide that fall into the SLA period.
  7. Not very much tolerance on missed/delayed alerts. We’d rather have false-positives than false-negatives.
michal-kalisz commented 4 months ago

Thank you for drafting the explainer. It looks very promising!

We wanted to raise a certain doubt regarding opt-in: In our understanding, opt-in should be independent of auction configuration (provided by the seller). A much more convenient solution would be if the buyer could indicate that they want to receive reports - this could be done at the time of attestation or through a separate http request.

Thanks, Michal

rushilw-google commented 4 months ago

Hi @michal-kalisz, thanks for your note! Regarding opt-in, we found it helpful to consider four aspects of the options for the opt-in mechanism - privacy implications, performance implications, build complexity/time and adtech preference beyond these reasons.

With that lens, sharing our thoughts on the options below:

1) Opt-in through buyer-seller coordination in auctionConfig: This is the proposed option and has privacy characteristics similar to trusted signals fetches, is quickest to implement and has minimal performance implications. We recognize that buyer-seller coordination would be required for this option but consider this feasible as some existing features already necessitate coordination, as in the case of perBuyerSignals.

2) Opt-in through attestation: This option is under evaluation for feasibility as attestations were not originally designed for this purpose. We are also considering the performance implications and build complexity. We will post an update once this evaluation is completed.

3) Opt-in through HTTP request: On the performance side, added cost and latency are the primary concerns with creating a new HTTP request for opt-in declaration. To mitigate these concerns, adtechs could include opt-in declaration in the HTTP response header of the bidding or scoring script fetch requests, but if there are errors in fetching the script or downstream from that, Real Time Monitoring would not be able to send any reports for those errors. This does not seem preferable for adtechs.

We plan to launch Real Time Monitoring with the buyer-seller coordination mechanism while continuing the discussion on other mechanisms, if you have any further thoughts on those.

JensenPaul commented 3 months ago

Much of the proposal has been implemented in Chrome and can be tested in Chrome Canary today using the --enable-features=FledgeRealTimeReporting command line flag.

ccharnay67 commented 1 month ago

Hello,

If much of the proposal has been implemented, does it mean if real-time reports are emitted, we should be able to receive them on the .well-known/interest-group/real-time-report already? If so, how are they serialized? It is not explicit yet in the explainer

The serialization format of the report is TBD

Has the format been decided yet? Is it a raw vector of 1024 bytes with value 0 or 1, or is it more elaborate?

JensenPaul commented 1 month ago

The Real Time Monitoring API should now be available for testing in 50% of Chrome Canary and Dev channel traffic.

JensenPaul commented 1 month ago

we should be able to receive them on the .well-known/interest-group/real-time-report already?

Yes, if the API is enabled via the command line flag or as part of testing on pre-stable Chrome traffic, and your origin is opted in via the auction config.

If so, how are they serialized?

1226 specifies the serialization format.

dmdabbs commented 1 month ago

Running Canary with cmdline flag I confirm the POST requests are sent.

t=136077 [st=170] HTTP_TRANSACTION_SEND_REQUEST_HEADERS --> POST /.well-known/interest-group/real-time-report HTTP/1.1
   ....
   User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
   ....

If you are also interested in receiving low entropy UACH so that you have the option to differentiate Mobile, &c. real-time events, note your interest here.

qingxinwu commented 1 month ago

The Real Time Monitoring API should now be available for testing in 50% Beta channel traffic.

dmdabbs commented 2 days ago

Hello @qingxinwu. The FledgeRealTimeReporting cmdline flag is (for me) no longer enabling the feature in Canary.