Private model training: Improving the efficacy of `modelingSignals`

csharrison commented 7 months ago

This issue aims at helping improve the support for bid optimization in Protected Audiences without impacting the privacy stance of the API. This use-case typically involves:

Predicting an outcome associated with serving a certain ad (e.g. a predicted click through rate or conversion rate)
Varying bid price to optimize this outcome

Models to learn these predictions are typically trained via supervised learning techniques, i.e. where examples are labeled with an outcome (click, conversion, etc).

There are two techniques we are exploring to improve the status quo here:

A mechanism where modelingSignals can be encrypted and processed in a trusted server environment, where we can offer private model training algorithms.
An improved privacy mechanism to release modelingSignals directly to reportWin. This could look like changes to the existing randomized response mechanism.

Of these two techniques, we think (1) will provide the most utility for this use-case, although it introduces the most complexity to the system.

I am filing this issue to collect feedback about the model training use-case. I think we have a pretty good understanding of the shortcomings of the existing modelingSignals approach (mainly from a low dimensionality standpoint). However, there are lots of auxiliary use-cases / developer journeys that are involved with training models, these include:

Feature engineering and A/B testing new features: the ability to support trying out new features and seeing how they perform in offline evals and live experiments.
Feature transforms and backtesting new features: the ability to support offline evals with a new feature on old data, which might require some step to transform old examples to use the new feature.
Hyperparameter tuning
Debugging / monitoring

We’re interested in better understanding these kinds of use-cases. What are we missing? Please let us know, through this issue, if there are other use-cases we should consider when thinking through improvements here.

cc @nikunj101 @michaelkleber

mvono commented 6 months ago

Thanks a lot Charlie for raising this issue which is of high interest for Criteo as the performance of the ad campaigns that we serve is tightly linked to how efficient and precise our machine learning (ML) algorithms are.

First, I want to provide you with some background and details regarding the Criteo ML use-cases where more support from Chrome would be needed. I will restrict these use-cases to bid optimisation which is an important ML use-case on our side and as it is also the focus of this issue.

We are operating a whole AI system for bidding which involves (as you mentioned at the end):

training multiple supervised ML models (incl. hyperparameter tuning), possibly on similar batches of data;
monitoring our ML models during training (e.g. looking at the training loss & input data);
testing offline & online (A/B testing) new models & feature engineering approaches (e.g. uplift compared to production)
debugging ML models in production (e.g., comparing the offline & online logs of some computations)

We appreciate the fact that you are considering several alternatives to improve the status quo around model training and are committed to help you defining the best one for Criteo and the whole industry. As of today, since no precise technical specification is available, it is quite difficult to arbitrate on the approach to push for. In order to help you in that direction, could you provide more details/insights on the following points:

Approach 1 (modelingSignals processed in a trusted server)

In a previous github issue (Protected Audience Opt-In TEE K/V Mode · Issue #892 · WICG/turtledove), Michael hinted the fact that if computations are performed within a trusted server, the browser could send more information to the trusted server. Are you envisioning to make more features available within the trusted server for training on modelingSignals ? If so, what type of features could be available?
For the sake of end-to-end training, we need to not only train supervised ML approaches on user/in-browser features encoded in modelingSignals but also on other features such as contextual ones. Would the latter features be available within the trusted server?
The ML training computation can be seen as a sophisticated aggregation strategy over raw data. Would the envisioned API feature allowing for such as training be similar to ARA Aggregated Reporting?
Would not putting global differential privacy (DP) be considered if the report collector (e.g. Criteo) is not querying the model but send it instead to the K/V store for inference? Or are you considering global DP as a safeguard against side-channel attacks?
Regarding points 2-4 above, we would need such metadata information without noise. Otherwise, it would be quite complicated to asses the relevancy of the models we are using. Would it be something that could be envisioned?
As pointed out in 1 above, we are not only training a single model but dozens on the same batch of data, in addition to other offline experiments to improve existing models. It seems that splitting the privacy budget on all these tasks will impact significantly the performance of our models if global DP is considered. What are you envisioning to guarantee performance in this challenging setting?
Given the size of our current training log, we need to train our ML models in a distributed fashion. How this constraint will be tackled using Approach 1 knowing that distributed training using TEEs is currently not very mature? Would you consider training inside a single trusted server on a batch of reduced size?
When could you share a technical specification on which we could iterate?

Approach 2 (modelingSignals released to reportWin but with local DP)

What type of local DP are you envisioning : full (label + feature) local DP or only label DP?
If full local DP is considered, we are wondering if

releasing continuous/numerical information with local DP noise (Gaussian, Laplace) will be considered;
we could choose how to split the privacy budget between the different dimensions?

On top on that constraint, would additional obfuscation strategies be considered (e.g., x-bit encoding for user features)?
When could you share a technical specification on which we could iterate?

Thanks, Maxime Vono (Criteo).

nikunj101 commented 6 months ago

Thank you Maxime for sharing detailed thoughts on the use case. We are currently in early stages of exploration and some of these details may change as we finalize the mentioned API design. Sharing some early thoughts for your specific questions below:

In a previous github issue (Protected Audience Opt-In TEE K/V Mode · Issue #892 · WICG/turtledove), Michael hinted the fact that if computations are performed within a trusted server, the browser could send more information to the trusted server. Are you envisioning to make more features available within the trusted server for training on modelingSignals ? If so, what type of features could be available?

We are exploring mechanisms that allow you to generate custom value of modelingSignals exactly the same way as today, but with a relaxation on the size constraint and noising mechanisms different from randomized response. This will come with the caveat of payload being encrypted and only being accessible in TEEs.

For the sake of end-to-end training, we need to not only train supervised ML approaches on user/in-browser features encoded in modelingSignals but also on other features such as contextual ones. Would the latter features be available within the trusted server?

We are considering passing encrypted modelingSignals via reportWin the same way as today and not restricting access to other signals in reportWin functionality. The complete reports (which can contain contextual signals) collected via reportWin should be processable in TEEs for model training.

The ML training computation can be seen as a sophisticated aggregation strategy over raw data. Would the envisioned API feature allowing for such as training be similar to ARA Aggregated Reporting?

Yes, at a very high-level, the API feature should look very similar to processing reports in ARA aggregate reporting.

Would not putting global differential privacy (DP) be considered if the report collector (e.g. Criteo) is not querying the model but send it instead to the K/V store for inference? Or are you considering global DP as a safeguard against side-channel attacks?

Could you confirm if the question is for training models without DP when serving is in TEEs or is it for not applying DP on inference? In general, we are only exploring model training techniques which guarantee differential privacy, irrespective of where we serve the models. Allowing inference on a model, trained without differential privacy, has the risk of leaking sensitive user information. In above settings where the model is trained with DP, inference can potentially be shared in non-noised form.

Regarding points 2-4 above, we would need such metadata information without noise. Otherwise, it would be quite complicated to asses the relevancy of the models we are using. Would it be something that could be envisioned?

Could you confirm which metadata are you referring to? In general, depending on the nature of metadata, they can be considered sensitive (example: evaluation loss) and will need to be constrained by differential privacy & privacy budgets.

As pointed out in 1 above, we are not only training a single model but dozens on the same batch of data, in addition to other offline experiments to improve existing models. It seems that splitting the privacy budget on all these tasks will impact significantly the performance of our models if global DP is considered. What are you envisioning to guarantee performance in this challenging setting?

This remains an open area, where we expect adtech companies to explore different ways of provisioning their privacy budget across production and experimentational model needs.

Given the size of our current training log, we need to train our ML models in a distributed fashion. How this constraint will be tackled using Approach 1 knowing that distributed training using TEEs is currently not very mature? Would you consider training inside a single trusted server on a batch of reduced size?

Yes, we are thinking early solutions might need to train on single TEE machines to ensure data security and user privacy. Adtech companies might have to balance between training data size vs training speed.

When could you share a technical specification on which we could iterate?

We are actively investigating the above mentioned settings and their impact on privacy and utility. We will try to share more details soon.

Approach 2 (modelingSignals released to reportWin but with local DP)

What type of local DP are you envisioning : full (label + feature) local DP or only label DP?

We are exploring label DP as well as hybrid-DP (sensitive features would be noised, non-sensitive features would not) for local DP.

If full local DP is considered, we are wondering if

releasing continuous/numerical information with local DP noise (Gaussian, Laplace) will be considered;

Yes, we are exploring ways to release noisy counts, value estimates with local DP noise and understand privacy utility tradeoffs for different approaches.

we could choose how to split the privacy budget between the different dimensions?

I think this can be considered as long as the budget-split does not regress the privacy of the current modelingSignals API. We can also model this after the Flexible Event API, which similarly allows for users to modify the reports they receive in order to minimize noise while ensuring DP.

On top on that constraint, would additional obfuscation strategies be considered (e.g., x-bit encoding for user features)?

With full local DP (or Hybrid-DP) on features additional obfuscation strategies might not be necessary.

When could you share a technical specification on which we could iterate?

We are actively exploring the mechanisms and trying to understand the impact on privacy and utility. We will try to share more details soon.

WICG / turtledove