TEE Inference service - Githubissues

pstorm commented 4 months ago

As part of last week's call, I'm raising this to request for more details about TEE inference service. Will ONNX runtime be supported in this inference service?

akshaypundle commented 4 months ago

Hi!

For more details regarding our inference proposal, please see our explainer here: https://github.com/privacysandbox/protected-auction-services-docs/blob/main/inference_overview.md

We are planning a presentation in the WICG discussion, tentatively for 10th Apr. We will add this to the agenda (https://docs.google.com/document/d/1Hk6uW-i4KPUb-u20E-EWbu8_EbPnzcptnhV9mxBP5Mo/edit#heading=h.5xmqy05su18g) once this is finalized.

Initially we will support Tensorflow (C++) and PyTorch libtorch runtimes, but not ONNX runtime. We can investigate support for ONNX runtime in the future based on feedback we receive from the community.

Akshay

akshaypundle commented 4 months ago

We recently open sourced the code for inference. Feel free to check it out at: https://github.com/privacysandbox/bidding-auction-servers/tree/release-3.4/services/inference_sidecar

thegreatfatzby commented 4 months ago

Hey @akshaypundle is the following correct, that the inference service:

Must live on the same host that runs generateBid().
Cannot be on a host that has storage attached?

If that's right, what is achieved by those constraints? I'm thinking through this more broadly for server side private auctions in general (including ASAPI), but specific to inference why not do something like:

The Bidding Service itself can be backed by the KV Server.
Better yet, the Bidding Service can make any* call out to any other TEEs owned by the same tech in the same network running Roma with the appropriate Chromium attested code, including ones backed by the KV Server using whatever UDF the ad tech defines; those boxes would themselves have the same requirement.

The second option would open up a few really valuable options:

Decoupling functionality is one of the main ways to achieve cost savings with both engineering and domain/organizational complexity scale. Engineering scale is helped by being able to tune and assign different workloads to different types of machines with different data, SLA, and scaling needs (not all requests may need Function/Model X and core bidding can happen if it fails, whereas Function/Model Y needs very high reliability and breaks bidding if it goes down...X and Y being coupled physically prevents independent scaling and operations, which is key to running a distributed system); Domain/organizational complexity can be solved by splitting systems between teams and evolving over time, and it causes tons of friction when different teams have to share the same codebases and/or production hardware.
We're (and I include ASAPI in this) forcing a lot of structure on the auction, which will limit the Domain Models that businesses can create, which will ultimately limit not just their operations (see previous point) but the features they can build, and not to get philosophical but constrain the way we think about auctions. For instance, I'm not clear why the Ad Retrieval Service wouldn't be able to run an inference sidecar.
This would also allow for topologies that more closely resemble the non-TEE distributed systems that we own, and make it easier for us to try to manage one set of codebases that run WASMized in TEEs and non WASMized in normal servers.

I understand and support having carefully crafted output gates from the TEEs, but I'm wondering, both in general but here for inference, why we wouldn't open things up within the TEEs, and avoid constraining operations within them.

thegreatfatzby commented 4 months ago

Also @akshaypundle re the demo, Wed April 10th a bunch of folks will be at the PATCG F2F.

akshaypundle commented 4 months ago

Hi Isaac,

Thanks very much for the feedback.

Regarding constraints of where inference can be run:

Our initial proposal is to implement inference on the generateBid and prepareDataForAdRetrieval UDFs (on the bidding servers). If there is ecosystem interest, we could expand the sidecars to Auction servers or K/V servers. This means that the same inference API can be made available to UDFs running on Auction, K/V or other TEE servers in the future.

Regarding separate scaling of servers, such considerations do exist in the current PA/PAS systems, and that is one of the reasons that the K/V servers are separate from the B&A servers. Extracting inference into its own service comes with privacy challenges though. For example, just the fact that an inference call was made from a UDF is a 1 bit leak. Observing the traffic to the servicing TEE will provide an observer this information. Since the number of runInference calls is not limited, we can probably come up with a scheme to leak any number of bits from the UDF just by observing whether inference calls were made or not. Such cases use techniques like chaffing (see here) to reduce privacy risk, but these add to cost. We would need a detailed analysis of the threat model and mitigations before we can have a separate inference service..

Pulling out inference as its own service is something that we will consider more in the future, as we look at more use cases and independently scaling the service, especially once machines with accelerators are available in the TEE. Our current architecture runs a GRPC service inside the sandbox (on the same host). This could be extracted into a distinct service in the future - the GRPC design helps keep our options open. The current design helps us sidestep some privacy concerns and deliver a valuable feature quicker, so we can get feedback and iterate on the design. As we get more data, we are open to making changes to the architecture to provide maximum flexibility / utility while preserving user privacy.

Regarding your proposal:

I am not sure I fully understand your proposal. Are you saying that you would like to run ad tech provided javascript code inside Roma on a different machine, which can be accessed from UDFs (all involved machines being TEEs)? Where will the inference code run? Will it run inside Roma (javascript / wasm)? What backends will run inference?

In our proposal, we run the C++ inference library backend (for TF and PyTorch), and make this available to the javascript UDFs through an API. This means the predictions are run on mature battle tested, performant systems (the core TF and PyTorch libraries).

The next WICG-servers call is scheduled for the 24th April and we are planning to do the inference presentation in that call. I’m happy to discuss this more.

Thanks! Akshay

fhoering commented 3 months ago

Our initial proposal is to implement inference on the generateBid and prepareDataForAdRetrieval UDFs (on the bidding servers). If there is ecosystem interest, we could expand the sidecars to Auction servers or K/V servers. This means that the same inference API can be made available to UDFs running on Auction, K/V or other TEE servers in the future.

@akshaypundle Can you confirm that adding inference to prepareDataForAdRetrieval means that this API will be available from inside inside the key-value-service code base ?

I found some doc about ad retrieval overview but not this dedicated API https://github.com/privacysandbox/protected-auction-key-value-service/blob/release-0.16/docs/ad_retrieval_overview.md

For simplicity it seems important to be able to run inference from inside one single container which seems to be the key value service TEE (as described in the ad retrieval doc). Spawning a cluster of TEEs seems premature and can still be done a later optimization stages.

galarragas commented 3 months ago

@fhoering , the prepareDataForAdRetrieval is not run in the K/V but in the Bidding server, You can find some extra details in the Android PAS documentation. As of today there has been no recognized need in the PAS architecture to have inference capabilities on the K/V server, although technically feasible as @akshaypundle says.

fhoering commented 3 months ago

My understanding of this document is that it is not specific to Android but that the workflow could be applied even on Chrome web on-device workflows.

Today the key/value service already supports UDFs. So for on-device auctions which is what is currently being tested it seems natural to also support inference without having to deploy and maintain additional bidding servers.

galarragas commented 3 months ago

My comment was just related to the prepareDataForAdRetrieval that is a function part of the PAS architecture, currently only supportedon Android.

akshaypundle commented 3 months ago

Thanks @fhoering and @galarragas.

@fhoering , to answer your question, inference will not initially be available from the K/V codebase. It will be available only on the bidding server initially. The bidding server runs both the prepareDataForAdRetrieval and the generateBid functions, and inference will be available from these 2 functions.

The prepareDataForAdRetrieval runs on the bidding server. It's outputs are sent to the K/V (ad retrieval) server. These outputs can be used to ad filtering, etc. on the K/V server. For details, see Android PAS documentation.

In the future, it may be possible to extend inference to K/V or other TEE servers (e.g. Auction server). But as of now, the additional inference capabilities will only be available on the Bidding server initially.

fhoering commented 3 months ago

@fhoering , to answer your question, inference will not initially be available from the K/V codebase. It will be available only on the bidding server initially. The bidding server runs both the prepareDataForAdRetrieval and the generateBid functions, and inference will be available from these 2 functions.

@akshaypundle It should me made available. How to move forward on this ? Should I create a new github to formally ask for support and why this is needed or should we put it on the agenda of the WICG call for discussion ?

akshaypundle commented 3 months ago

Hi @fhoering , Yes, I think creating a new github and adding it to the agenda for discussion sounds good!

Akshay

thegreatfatzby commented 3 months ago

Hey @akshaypundle and @galarragas want to confirm in which of the 3 flows Inference Service (runInference) will be available.

Android Protected App Signals
Android Protected Audience Signals
Chrome Protected Audience Signals using Bidding and Auction Services

I think the documentation plus overlapping name of "Protected Audience Signals", despite the APIs having differences, is causing me some confusion. For instance:

The comment above seems to indicate it's only available on Android.
The main fledge explainer, fledge browser BA don't mention "inference" or "runInference".
The Inference Overview indicates "Protected Audience" but I don't think I see it definitively say whether that includes Chrome initiated flows.
EDIT: also not clear from 3/27 or 4/24 notes, although that might be a notes review issue rather than reflective of the conversation.

My best guess ATM is that it's supported in both flows on Android and not the Chrome one?

thegreatfatzby commented 2 months ago

@akshaypundle @TrentonStarkey while there's some energy on this, ping on the specific comment right above, RE in which flows inference is available.

TrentonStarkey commented 2 months ago

@thegreatfatzby Currently, the inference service only works with Android Protected App Signals. We plan to expand this to Protected Audience auctions in B&A for both Chrome and Android, but we don't have specific timelines yet. We'll share more details in public explainers as they become available.

WICG / protected-auction-services-discussion

TEE Inference service #55

Regarding constraints of where inference can be run:

Regarding your proposal: