WICG / turtledove

TURTLEDOVE
https://wicg.github.io/turtledove/
Other
537 stars 236 forks source link

Trusted Server Flexibility: Allow TEEs To Call Each Other More Flexibly, All TEEs can be KVs #1140

Open thegreatfatzby opened 7 months ago

thegreatfatzby commented 7 months ago

Overview

Thinking towards the genuinely private future, both on device and BA, I've been pondering how we're constraining architecture and domain models, which ultimately will constrain operations, cost structures, and utility. One of the constraints I think we could loosen without sacrificing privacy would be how data can a) flow into different TEEs and b) travel once it hits the Trusted Execution Environment.

Right now, if you tilt your head and squint real good, the KV Server almost looks like a real server that you can deploy code and data to: you can install WASM'ized C/C++/etc code, you can have data synced in batch or incrementally, you can shard data and do some load balancing...debugging is still a problem, but that's kind of a separate issue.

The "almost" is because the inability to make network calls prevents typical service composition where you'd have different bits of logic owned by different teams, that can interact with each other, scale independently, have separate SLAs and QoS guarantees, and different cost structures. Additionally, currently not all TEEs are backed by the KV server and it's data storage.

So what I'd like to propose is that we allow:

  1. TEEs to talk to each other, still within limits.
  2. All TEEs (including buy/sell services) to be backed by the same KV data storage as the KV TEE.
  3. Ideally, very limited calls could be made out of the TEEs as well.

I understand (3) has privacy risks, so I'll ignore that here since I think (1) and (2) are more clear wins.

Example

To illustrate via example, I'd love to be able to propose the following dream world to our DSP folks, where we try to make "private replicas" of existing systems and topologies.

  1. Multiple KV Server Equivalents: We currently have, in effect, 4 real time bidding signal stores, 1 each for a) 3PC b) EIDs and CHIPs c) frequency/recency data, and d) contextual segments. Those 4 systems are owned by 2 different teams. Each of them has a data store and some API logic around it. For each of the codebases we'll add a build step to WASMize the code base and then on deployment we'll automate deploying to both normal on prem and then private TEEs. The on prem data stores will replicate to the KV store. The APIs here are simple, the data is very large, and different data has different value to the auction, so we'd likely give very powerful boxes to 1a and 1c, but possibly less so (for now) for 1b and 1d. Those boxes would also be controlled by different teams.
  2. Buyer Bidding Service, 1: Our main bidding engine, owned by a different team than above and with different deployment/development needs and timelines, gets close to real time updates not from user data but from businesses, and that is used as part of the bidding process. Let's make that the the Bidding Service TEE, where the same WASMization build/deploy applies as in (1), and we replicate business object updates into the data plane (since it's now backed by a TEE), and the JS handles taking the generateBid call and mapping it to the WASMized logic. These boxes wouldn't need as much storage but would need a lot of memory and compute.
  3. Buyer Bidding Service, 2: Since the KV-TEE Bidding Service now has access to the business objects it needs to bid and we can determine campaign eligibility more flexibly, getting budget data more dynamically is needed...fortunately, a different team owns their own TEE based KV that the Bidding Service can hit directly. The owning team updates spend data on prem in a budget service in NRT, and that is replicated to the "private replica", with fairly high SLAs and QoS on it.
  4. Buyer Bidding Service, 3: We have a number of subsystems that help with various lookups, logic, inferencing, predictions, etc. Some of them we can live without if they're down, so we might make them relatively weak private replicas.

Specifics

Inter TEE Communication

I'd propose the following:

  1. TEE owners can send any request with any data available to them in the private process to other TEEs they own.
  2. A TEE owner can send a request to non-owned TEEs but is restricted to forwarding the original request with encrypted IG data, to ensure that only the data that should be accessible to them is present and they need the key to decrypt their data. (To be clear I say "TEE Owner to non owned TEEs", rather than saying Seller can send to Buyer, because it could open up the door for interesting vendor service structures.

All TEEs can be KVs

Currently we've made it so the Buyer Logic will get all it's real time data from the KV server but isn't co-located with data...I'm not sure I see a privacy reason for this if the server is trusted...so just allow the BFE/SFE/etc to be backed by a KV.

Conclusion

Allowing this would get us much closer to an ad tech being able to replicate their current operations, team-system ownership structures, logic, cost structure, etc, in the private environment, rather than having to re-implement both logic and topology, and merge different teams systems and operations, which will inevitably lead to issues.

peiwenhu commented 6 months ago
  1. TEEs to talk to each other, still within limits.
  2. All TEEs (including buy/sell services) to be backed by the same KV data storage as the KV TEE.
  3. Ideally, very limited calls could be made out of the TEEs as well.

I understand (3) has privacy risks, so I'll ignore that here since I think (1) and (2) are more clear wins.

I've had server composition support in mind for KV for a while which I hinted at privacysandbox/protected-auction-key-value-service/issues/10.

But even for (1) there's the privacy risk of traffic analysis. The KV server sharding functionality is a more confined case of the vision here and preventing traffic analysis there is still a pain. So we have not invested more in server composition.

My hope was to see how the sharding usage pattern pans out in real world and see if that gives us more data points on supporting even more advanced topology. But if you think this is a promising direction for real uses, maybe we can prioritize researching it sooner.

thegreatfatzby commented 6 months ago

Hey @peiwenhu I personally see it as an incredibly useful direction, but of course I'd really love to hear from folks like @rdgordon-index @jonasz @fhoering @lbdvt @davideanastasia and others.

That said, to elaborate on my thinking, let's say we could snap our fingers and say that ad techs could focus on two things:

  1. WASM'izing their code bases for usage in generateBid, scoreAd, etc.
  2. Replicating their data sets into TEEs that they could shard/route to as needed.

And that we'd rely on the BA/ASAPI framework to coordinate the inputs into bidding and auction functions, enforce output privacy, and if it makes sense input privacy as well. Then I really think we'd be in a different world then today when we have to completely change topology, domain models, logic, cost structures, etc. I think we'd still have a lot of problems to solve (debugging in enclaves, performance overhead), but we'd be solving them within the kinds of architectural and modeling constraints we've spent man-centuries developing expertise in.

palenica commented 6 months ago

/cc @yarongmu-google

peiwenhu commented 6 months ago

Tentatively we plan to look into this in H2 this year. We expect there will be constraints for privacy reasons and we can only focus on the lower hanging fruits for now. For example, there could be a fixed size requirement on TEE-TEE request/response, there could be a fixed # of endpoints each TEE server knows and that cannot be changed at runtime, and requests could need to be sent to all said endpoints etc.. So it would still require some careful design from the server operator side, albeit closer to the more classic architectural model.