Closed barteklos closed 2 years ago
We would like to follow up this issue and discussion we had on the same topic.
As suggested:
We have added some additional functional and performance tests to our framework. In particular, we have done benchmarks with V8 which run generateBid() and compare usage of webassembly and lack of it:
no. | test environment | code run as | time spent on generateBid() |
---|---|---|---|
3 | V8 engine without wasm | buyer’s js | 54.12 ms |
4 | V8 engine with wasm | buyer’s js with wasm binary hardcoded | 4.93 ms |
The performance seems to be good enough in the mentioned scenario and we believe that similar results could be achieved in a bidding worklet.
In benchmark with webassembly we hardcoded wasm binary and instantiated it in generateBid() which means that it could be improved by:
which would reduce an additional 1.35 ms in that case.
All in all, is it an option to provide a bidding worklet implementation with support for webassembly? If so, is it an option to provide some API extensions to achieve (A) and (B) ?
We can see some additional benefits related to such a support (better performance for an inefficient hardware, reducing time of the script initialization and model weights parsing, additional obfuscation, potentially easier migration of the current code, SIMD operations availability etc.).
For the record, we have provided another patch which turns on webassembly in Chromium, so we were able to run a similar benchmark in a bidding worklet:
no. | test environment | code run as | time spent on generateBid() |
---|---|---|---|
5 | bidding worklet (with wasm support) | buyer’s js with wasm binary hardcoded | 6.07 ms |
In this benchmark, the bidding worklet spends time on:
Just in case you are not aware of this, we would like to share our findings:
Our first attempt to run a benchmark with webassembly in Chromium was not successful. There was a significant difference between V8 and the bidding worklet, mainly in the case of wasm, and our test case took over 26 ms. It was because we were compiling Chromium with default flags which add debug asserts. The solution was to build Chromium with dcheck_always_on=false
. The official raw build of Chromium seems to have the same overhead but Chrome release uses is_official_build=true
, which also turns off these debug asserts.
Chrome supports caching compiled webassembly modules and we were wondering if a similar mechanism could be used in case of the bidding worklet (reference: this blog post). It requires using WebAssembly.compile and WebAssembly.instantiate APIs (which are async) and storing compiled wasm modules in DB. Do you have in mind a similar approach?
Right, AuctionV8Helper::RunScript runs:
I have edited a previous comment to avoid confusion.
AuctionV8Helper::Compile takes 292.59 ms in this scenario. I did not take this into account mainly because the bidding worklet’s timeout does not include time of js loading and js compiling. However, I must admit that in the case of a huge js script (with model weights or wasm binary hardcoded) it could have some impact on overall performance, especially that AuctionV8Helper::Compile is called twice, in the context of generateBid and reportWin, for every auction. Script itself could be potentially cached by network layer but compiled js would not be cached in the current implementation.
A table below shows adjusted results for benchmark 2 and benchmark 5 (run with a new Chromium build without debug asserts):
no. | time spent on AuctionV8Helper::Compile (not included in timeout) |
time spent on AuctionV8Helper::RunScript (included in timeout) |
---|---|---|
2 | 87.70 ms | 22.41 ms |
5 | 292.59 ms | 6.07 ms |
Hi,
We have started experimentation with the current FLEDGE implementation in Chromium. As part of this, we have provided end-to-end functional and performance tests.
For this issue we would like to discuss the bidding worklet's performance limitations in the context of potential bidding logic. To give an example, our production
generateBid()
implementation could evaluate a feed-forward neural network with 3-4 layers (repeated for 5 different ML models) and then it would look like this:where
extractFeatures()
extracts vector of 200 features (from signals and interest group’s data) andnn_forward()
is:This is an extremely simplified version of
generateBid()
and focuses on multiplying the input values by the hard-coded model weights. We can expect a lot of additional boilerplate code (choosing the best ad, model feature extraction, capping & targeting logic, brand safety etc.) around this but even such a simple example is enough to illustrate performance limitations for the current implementation.We have results from benchmarks for two different environments running the same
generateBid()
function:generateBid()
In conclusion, we can see a significant performance drop (almost 50x) for a bidding worklet compared to an optimal environment. What is more, we can easily exceed the worklet’s timeout (which is 50 ms) for the mentioned use case.
Do you have any thoughts on how to optimize
generateBid()
code in such an execution environment? Are there any plans to provide a more effective bidding worklet?Best regards, Bartosz