Bidding worklet performance limitations

Hi,

We have started experimentation with the current FLEDGE implementation in Chromium. As part of this, we have provided end-to-end functional and performance tests.

For this issue we would like to discuss the bidding worklet's performance limitations in the context of potential bidding logic. To give an example, our production generateBid() implementation could evaluate a feed-forward neural network with 3-4 layers (repeated for 5 different ML models) and then it would look like this:

generateBid(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals, browserSignals) {

   const nn_model_1_weights = [
       [[1.23, 3.14, 2.7...], [100.1, 100.2,...], ...], // 200x200 matrix
       [...], // 200x100 matrix
       [...], // 100x50 matrix
       [...], // 50x1 matrix
   ]; // hard-coded weights for the 1st model (eg. CTR, CR, CV)

   const nn_model_2_weights = [...]; // hard-coded weights for the 2nd model

   const nn_model_3_weights = [...]; // hard-coded weights for the 3rd model

   const nn_model_4_weights = [...]; // hard-coded weights for the 4th model

   const nn_model_5_weights = [...]; // hard-coded weights for the 5th model

   let input = extractFeatures(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals,
browserSignals); // vector of 200 floats

   let bid = nn_forward(input, nn_model_weights_1) * nn_forward(input, nn_model_weights_2)
                * nn_forward(input, nn_model_weights_3) * nn_forward(input, nn_model_weights_4)
                * nn_forward(input, nn_model_weights_5);

   let ad = ... 

   let renderUrl = ...

   return {'ad': ad,  'bid': bid, 'render': ad.renderUrl};
}

where extractFeatures() extracts vector of 200 features (from signals and interest group’s data) and nn_forward() is:

nn_forward(input, nn_model_weights) {
    let X = input; // vector of 200 floats
    X = relu(multiply(nn_model_weights[0], X)); // nn_model_weights[0] - 200x200 matrix
    X = relu(multiply(nn_model_weights[1], X)); // nn_model_weights[1] - 200x100 matrix
    X = relu(multiply(nn_model_weights[2], X)); // nn_model_weights[2] - 100x50 matrix
    X = relu(multiply(nn_model_weights[3], X)); // nn_model_weights[3] - 50x1 matrix
    return X[0];
}

This is an extremely simplified version of generateBid() and focuses on multiplying the input values by the hard-coded model weights. We can expect a lot of additional boilerplate code (choosing the best ad, model feature extraction, capping & targeting logic, brand safety etc.) around this but even such a simple example is enough to illustrate performance limitations for the current implementation.

We have results from benchmarks for two different environments running the same generateBid() function:

no.	test environment	code run as	time spent on `generateBid()`
1	V8 engine with jit	tight loop with a warm-up	1.12 ms
2	bidding worklet (with its limitations: jitless etc.)	buyer’s js	55.68 ms

In conclusion, we can see a significant performance drop (almost 50x) for a bidding worklet compared to an optimal environment. What is more, we can easily exceed the worklet’s timeout (which is 50 ms) for the mentioned use case.

Do you have any thoughts on how to optimize generateBid() code in such an execution environment? Are there any plans to provide a more effective bidding worklet?

Best regards, Bartosz

We would like to follow up this issue and discussion we had on the same topic.

As suggested:

we put our patch into the Chromium repository,
we replaced node.js by V8 in our benchmark 1 to make it suitable for comparison.

We have added some additional functional and performance tests to our framework. In particular, we have done benchmarks with V8 which run generateBid() and compare usage of webassembly and lack of it:

no.	test environment	code run as	time spent on `generateBid()`
3	V8 engine without wasm	buyer’s js	54.12 ms
4	V8 engine with wasm	buyer’s js with wasm binary hardcoded	4.93 ms

The performance seems to be good enough in the mentioned scenario and we believe that similar results could be achieved in a bidding worklet.

In benchmark with webassembly we hardcoded wasm binary and instantiated it in generateBid() which means that it could be improved by:

preloading binary (wasm) resources (A),
caching compiled webassembly modules (B)

which would reduce an additional 1.35 ms in that case.

All in all, is it an option to provide a bidding worklet implementation with support for webassembly? If so, is it an option to provide some API extensions to achieve (A) and (B) ?

We can see some additional benefits related to such a support (better performance for an inefficient hardware, reducing time of the script initialization and model weights parsing, additional obfuscation, potentially easier migration of the current code, SIMD operations availability etc.).

For the record, we have provided another patch which turns on webassembly in Chromium, so we were able to run a similar benchmark in a bidding worklet:

no.	test environment	code run as	time spent on `generateBid()`
5	bidding worklet (with wasm support)	buyer’s js with wasm binary hardcoded	6.07 ms

In this benchmark, the bidding worklet spends time on:

parsing JS: 3.81 ms,
calling JS: 2.25 ms (which includes compiling wasm: 2.04 ms and calling it: 0.21 ms).

Just in case you are not aware of this, we would like to share our findings:

Our first attempt to run a benchmark with webassembly in Chromium was not successful. There was a significant difference between V8 and the bidding worklet, mainly in the case of wasm, and our test case took over 26 ms. It was because we were compiling Chromium with default flags which add debug asserts. The solution was to build Chromium with dcheck_always_on=false. The official raw build of Chromium seems to have the same overhead but Chrome release uses is_official_build=true, which also turns off these debug asserts.
Chrome supports caching compiled webassembly modules and we were wondering if a similar mechanism could be used in case of the bidding worklet (reference: this blog post). It requires using WebAssembly.compile and WebAssembly.instantiate APIs (which are async) and storing compiled wasm modules in DB. Do you have in mind a similar approach?

Right, AuctionV8Helper::RunScript runs:

v8::Script::Run (which is js initialization, not js parsing!) and it takes 3.81 ms,
v8::Function::Call (which is js call) and it takes 2.25 ms.

I have edited a previous comment to avoid confusion.

AuctionV8Helper::Compile takes 292.59 ms in this scenario. I did not take this into account mainly because the bidding worklet’s timeout does not include time of js loading and js compiling. However, I must admit that in the case of a huge js script (with model weights or wasm binary hardcoded) it could have some impact on overall performance, especially that AuctionV8Helper::Compile is called twice, in the context of generateBid and reportWin, for every auction. Script itself could be potentially cached by network layer but compiled js would not be cached in the current implementation.

A table below shows adjusted results for benchmark 2 and benchmark 5 (run with a new Chromium build without debug asserts):

no.	time spent on `AuctionV8Helper::Compile` (not included in timeout)	time spent on `AuctionV8Helper::RunScript` (included in timeout)
2	87.70 ms	22.41 ms
5	292.59 ms	6.07 ms

barteklos / turtledove

Bidding worklet performance limitations #1