barteklos / turtledove

TURTLEDOVE
Other
0 stars 0 forks source link

Bidding worklet performance limitations #1

Closed barteklos closed 2 years ago

barteklos commented 3 years ago

Hi,

We have started experimentation with the current FLEDGE implementation in Chromium. As part of this, we have provided end-to-end functional and performance tests.

For this issue we would like to discuss the bidding worklet's performance limitations in the context of potential bidding logic. To give an example, our production generateBid() implementation could evaluate a feed-forward neural network with 3-4 layers (repeated for 5 different ML models) and then it would look like this:

generateBid(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals, browserSignals) {

   const nn_model_1_weights = [
       [[1.23, 3.14, 2.7...], [100.1, 100.2,...], ...], // 200x200 matrix
       [...], // 200x100 matrix
       [...], // 100x50 matrix
       [...], // 50x1 matrix
   ]; // hard-coded weights for the 1st model (eg. CTR, CR, CV)

   const nn_model_2_weights = [...]; // hard-coded weights for the 2nd model

   const nn_model_3_weights = [...]; // hard-coded weights for the 3rd model

   const nn_model_4_weights = [...]; // hard-coded weights for the 4th model

   const nn_model_5_weights = [...]; // hard-coded weights for the 5th model

   let input = extractFeatures(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals,
browserSignals); // vector of 200 floats

   let bid = nn_forward(input, nn_model_weights_1) * nn_forward(input, nn_model_weights_2)
                * nn_forward(input, nn_model_weights_3) * nn_forward(input, nn_model_weights_4)
                * nn_forward(input, nn_model_weights_5);

   let ad = ... 

   let renderUrl = ...

   return {'ad': ad,  'bid': bid, 'render': ad.renderUrl};
}

where extractFeatures() extracts vector of 200 features (from signals and interest group’s data) and nn_forward() is:

nn_forward(input, nn_model_weights) {
    let X = input; // vector of 200 floats
    X = relu(multiply(nn_model_weights[0], X)); // nn_model_weights[0] - 200x200 matrix
    X = relu(multiply(nn_model_weights[1], X)); // nn_model_weights[1] - 200x100 matrix
    X = relu(multiply(nn_model_weights[2], X)); // nn_model_weights[2] - 100x50 matrix
    X = relu(multiply(nn_model_weights[3], X)); // nn_model_weights[3] - 50x1 matrix
    return X[0];
}

This is an extremely simplified version of generateBid() and focuses on multiplying the input values by the hard-coded model weights. We can expect a lot of additional boilerplate code (choosing the best ad, model feature extraction, capping & targeting logic, brand safety etc.) around this but even such a simple example is enough to illustrate performance limitations for the current implementation.

We have results from benchmarks for two different environments running the same generateBid() function:

no. test environment code run as time spent on generateBid()
1 V8 engine with jit tight loop with a warm-up 1.12 ms
2 bidding worklet (with its limitations: jitless etc.) buyer’s js 55.68 ms

In conclusion, we can see a significant performance drop (almost 50x) for a bidding worklet compared to an optimal environment. What is more, we can easily exceed the worklet’s timeout (which is 50 ms) for the mentioned use case.

Do you have any thoughts on how to optimize generateBid() code in such an execution environment? Are there any plans to provide a more effective bidding worklet?

Best regards, Bartosz

barteklos commented 3 years ago

We would like to follow up this issue and discussion we had on the same topic.

As suggested:

We have added some additional functional and performance tests to our framework. In particular, we have done benchmarks with V8 which run generateBid() and compare usage of webassembly and lack of it:

no. test environment code run as time spent on generateBid()
3 V8 engine without wasm buyer’s js 54.12 ms
4 V8 engine with wasm buyer’s js with wasm binary hardcoded 4.93 ms

The performance seems to be good enough in the mentioned scenario and we believe that similar results could be achieved in a bidding worklet.

In benchmark with webassembly we hardcoded wasm binary and instantiated it in generateBid() which means that it could be improved by:

which would reduce an additional 1.35 ms in that case.

All in all, is it an option to provide a bidding worklet implementation with support for webassembly? If so, is it an option to provide some API extensions to achieve (A) and (B) ?

We can see some additional benefits related to such a support (better performance for an inefficient hardware, reducing time of the script initialization and model weights parsing, additional obfuscation, potentially easier migration of the current code, SIMD operations availability etc.).

barteklos commented 2 years ago

For the record, we have provided another patch which turns on webassembly in Chromium, so we were able to run a similar benchmark in a bidding worklet:

no. test environment code run as time spent on generateBid()
5 bidding worklet (with wasm support) buyer’s js with wasm binary hardcoded 6.07 ms

In this benchmark, the bidding worklet spends time on:

Just in case you are not aware of this, we would like to share our findings:

barteklos commented 2 years ago

Right, AuctionV8Helper::RunScript runs:

I have edited a previous comment to avoid confusion.

AuctionV8Helper::Compile takes 292.59 ms in this scenario. I did not take this into account mainly because the bidding worklet’s timeout does not include time of js loading and js compiling. However, I must admit that in the case of a huge js script (with model weights or wasm binary hardcoded) it could have some impact on overall performance, especially that AuctionV8Helper::Compile is called twice, in the context of generateBid and reportWin, for every auction. Script itself could be potentially cached by network layer but compiled js would not be cached in the current implementation.

A table below shows adjusted results for benchmark 2 and benchmark 5 (run with a new Chromium build without debug asserts):

no. time spent on AuctionV8Helper::Compile (not included in timeout) time spent on AuctionV8Helper::RunScript (included in timeout)
2 87.70 ms 22.41 ms
5 292.59 ms 6.07 ms