K-Anon Server and Data Operations

thegreatfatzby commented 1 year ago

I'm curious if there will be any public discussion of the k-anon server data operations. Given it'll be a critical dependency of many businesses, ours included, I'd be interested in things like:

Sharding and Replication: I could see sharding the K counts by IG owner being helpful as it could result in fewer ad techs experiencing failure in the case of server outages or other issues.
Geo Replication and Consistency: I'd imagine values will be updated immediately in the nearest DC and then eventually replicated to other regions? Will it replicate to all regions globally, just "local regions" (like US East to US West), or would that be configurable somehow.
Querying and Monitoring: will an ad tech have any insight into metrics relevant to their K values, numberOfObjects, numberOfUpdates, numKTooLow, numKAccepted etc. I understand we'd want to limit somewhat for privacy to avoid interesting attacks, but I'm also envisioning debugging without some of the metrics we're used to and can see that being helpful. Rounding values or some limited delay would be fine.

JensenPaul commented 1 year ago

@kgraney might have some answers

thegreatfatzby commented 12 months ago

@kgraney @JensenPaul any thoughts?

kgraney commented 12 months ago

Sorry, I missed the first mention in July. We're running the k-anon server on Google's internal infrastructure, so it has rather robust availability/reliability and is staffed with 24/7 SRE support. The high-level architecture is that reads are highly available with low-latency and lots of replication; basically we're pushing the entire list of things that are k-anonymous to many different data centers around the world on a periodic basis. A failure of any given data center or region will not impact read requests to the others; and read requests are routed with preference for nearby replicas. This data push is subject to differential privacy constraints, so rate limiting is important (pushing too often harms privacy) and noise is added (more noise obfuscates the behavior of individual users). It's a global push, so all objects update at the same cadence. You can read more about differential privacy considerations in our explainer.

Writes to the server, i.e. Chrome reporting that it's "part of an IG" or something similar, are subject to higher latency, but writes do have robustness against regional outages. Writes block on persisting the request to Spanner, which we depend on to provide regional failover in case the leader replica has an outage. The Spanner database is read and processed as part of periodic pushes to the servers that handle read requests, i.e. reads don't directly depend on Spanner availability other than for data freshness.

As you point out we can't really offer additional insight to specific objects beyond what the public API offers out of privacy concerns. Your best bet for debugging this type of data is to monitor the presence of SHA-1 hashes for your IG URLs or other objects using the existing public APIs (the same ones Chrome queries). It's also worth pointing out that we don't actually track any of these objects on a per-adtech basis. Objects in the system are just a SHA-1 value in a de-normalized list, and we're not trying to map these to any ownership graph, so there's really limited adtech specific insights we can offer. Our SRE team does monitor for system health metrics like freshness, availability, latency, etc.

Tagging @gwhitehawk for anything I missed.

thegreatfatzby commented 12 months ago

This is a fantastic response, although I feel I should say it's not necessary to do so at ~midnight EST, although I guess I don't know where in the whirld you are.

Couple of follow ups:

24/7 SRE support sounds good and I believe Google has a lot of skill at maintaining mission critical systems. On the remote chance that the k-anon server is down, slow, spewing 400's, etc, and the even remoter chance that that is happening and the SREs are unaware, how will we inform them?

Don't think I have any questions on the global and inter-data-center-replication piece, but a little curious for more detail on the intra-data-center replication. I assume a failure of a single node in a given DCs K-Server won't result in fail-over, either due to k-safety within the cluster (overloaded definition of k, here I mean data still available within DC given k=1/2/etc node failures) or some other data awareness.

On the writes, a question I hadn't thought of: is there any possibility of the write to the K-server being delayed to the browser being closed? I know this is an issue on the reporting side (for ARA in particular) and I guess depending on implementation it could be here too.

Objects in the system are just a SHA-1 value in a de-normalized list, and we're not trying to map these to any ownership graph, so there's really limited adtech specific insights we can offer. - why not shard by some function of IG owner to share less across all ad tech.

kgraney commented 12 months ago

I'll ask someone else to follow-up on support channels for adtech. I know they are being setup, but I haven't been following the details.

Writes from the browser are queued locally to avoid slowing down browsing, and we obviously don't control the user device, so we can't guarantee that it will remain powered on and connected to the internet before the queue is flushed. You can look into exactly how IG k-anonymity write requests are handled here: https://source.chromium.org/chromium/chromium/src/+/main:content/browser/interest_group/interest_group_k_anonymity_manager.cc;l=102-120;drc=6f4c64436342c818aa41e6a5c55034e74ec9c6b6

Within the same data center there is redundancy with other local machines. Read requests can be retried on a different machine and writes are eventually persisted to Spanner in a leader replica, which might be in a different data center from the one processing the HTTP request. Spanner leader replicas have internal redundancy & failover.

There is a small period of time when writes are queued (and batched) prior to persistence in Spanner. We might drop these in-flight writes in the case of individual machines failing; dropping only the portion queued on the failing machines. This trade-off was made for efficiency, acknowledging that the system is slightly noisy (in its output) and the ultimate source of writes (the queue within Chrome) isn't itself highly reliable. For example a user could delete their Chrome profile while requests are still queued on their device or their device might not respect a request from the server to retry a failed write.

As I mentioned we don't maintain a notion of "IG owner", or really anything IG specific, for the k-anonymity server.

thegreatfatzby commented 12 months ago

Fantastic again, thanks so much.

Only follow up on this piece is on the IG owner and k-anon partitioning piece...understood but maybe then I'd ask "why not do that?" I guess with the design as currently planned are you assuming the resilience will come from the other pieces and the odds of one client taking down the others is low? I guess I'm thinking about a case where, for some reason, one ad tech uses IGs in a creative and fun way but which challenges the k-servers (latency or availability) by making loads of extra K calls for name and/or creative k'ness.

michaelkleber commented 12 months ago

This design has a lot of data minimization work behind it. With your suggestion, Isaac, it seems like the folks administering the KV servers would be able to learn some approximations of "how many IGs does each ad tech have?"or "how many ads is each ad tech showing?", and that's not information that we want any access to.

kgraney commented 12 months ago

As Michael said, we don't want to know which objects are created by which adtech.

To address the concern about resilience: the thing is that normally Chrome decides when to make requests to the servers, so that's throttled/controlled for user traffic by the browser. In theory the servers can separately be called from outside the browser by anyone (they're more or less public), so sharding the servers by some dimension doesn't prevent those external calls from happening from anyone to any given shard of servers.

In general read requests are very cheap for us, so we can handle a lot of load. For write requests, though they're also relatively cheap, we have some anti-abuse restrictions in place that put a cost on making many requests.

thegreatfatzby commented 12 months ago

@michaelkleber thiiink that makes sense, can you elaborate a bit on the concern? Just don't want one business (Google) to know the other business's data, or is this part of the deeper privacy vector thinking with like, re-identification?

michaelkleber commented 12 months ago

Certainly any data that leaks in plaintext is some amount of reidentification risk. But mostly I was talking about avoiding Chrome's servers, or Google, learning anything unnecessary.

p-j-l commented 12 months ago

And just to reply to the "how to escalate if the K-Anonymity Service is down" question - right now I don't think we have an escalation path set up for that but it's an entirely fair request so we'll see if we can do something similar to the support structures for other bits of the Privacy Sandbox (e.g. Aggregation Service).

thegreatfatzby commented 12 months ago

@pjl-google that's great! While we're all still fluttering around this, just to clarify, do I need to do anything to put that request in?

p-j-l commented 12 months ago

Nope, this issue should work fine for that. (We actually need this sort of escalation/support structure for the K-Anonymity Service, Bidding/Auction Service, Aggregation Service, etc, etc, so we're trying to group things together and get a coherent setup between all of them.)

WICG / turtledove

K-Anon Server and Data Operations #728