kentosugama commented 1 year ago

I think it would be good to merge this so that we can measure performance improvements beyond wasm-opt and not reimplement optimizations already included in the optimizer.

Note that these benchmarks directly useic-wasm instead of using the optimize: "cycles" feature in dfx in order to preserve the wasm name sections for the flame graphs. For any users reading this, for the general case we recommend using the optimizer through dfx instead as the binary size reductions will be better when dropping the name sections.

For future reference: https://github.com/dfinity/sdk/pull/3090 See also #50 for previous discussions.

github-actions[bot] commented 1 year ago

Note Diffing the performance result against the published result from main branch. Unchanged benchmarks are omitted.

Map

	binary_size	generate 50k	max mem	batch_get 50	batch_put 50	batch_remove 50
hashmap	169_982 ($\textcolor{green}{-13.37\%}$)	2_097_113_506 ($\textcolor{green}{-12.15\%}$)	9_102_052	1_115_399 ($\textcolor{green}{-13.74\%}$)	609_254_124 ($\textcolor{green}{-11.61\%}$)	1_056_869 ($\textcolor{green}{-13.70\%}$)
triemap	174_030 ($\textcolor{green}{-13.76\%}$)	2_020_134_416 ($\textcolor{green}{-11.65\%}$)	9_715_900	773_637 ($\textcolor{green}{-13.26\%}$)	1_853_794 ($\textcolor{green}{-12.21\%}$)	1_033_460 ($\textcolor{green}{-12.98\%}$)
rbtree	171_127 ($\textcolor{green}{-13.99\%}$)	1_797_995_532 ($\textcolor{green}{-11.20\%}$)	8_902_160	670_401 ($\textcolor{green}{-14.90\%}$)	1_623_975 ($\textcolor{green}{-11.70\%}$)	859_340 ($\textcolor{green}{-13.34\%}$)
splay	170_477 ($\textcolor{green}{-13.84\%}$)	2_040_395_523 ($\textcolor{green}{-11.50\%}$)	8_702_096	1_102_393 ($\textcolor{green}{-12.39\%}$)	1_915_542 ($\textcolor{green}{-11.93\%}$)	1_103_332 ($\textcolor{green}{-12.42\%}$)
btree	198_636 ($\textcolor{green}{-15.60\%}$)	1_875_401_612 ($\textcolor{green}{-11.63\%}$)	7_556_172	813_525 ($\textcolor{green}{-13.14\%}$)	1_718_273 ($\textcolor{green}{-12.11\%}$)	862_047 ($\textcolor{green}{-13.07\%}$)
zhenya_hashmap	165_325 ($\textcolor{green}{-13.20\%}$)	1_642_423_605 ($\textcolor{green}{-11.77\%}$)	9_301_800	647_832 ($\textcolor{green}{-13.50\%}$)	1_447_024 ($\textcolor{green}{-12.52\%}$)	652_030 ($\textcolor{green}{-13.63\%}$)
btreemap_rs	438_979 ($\textcolor{green}{-14.72\%}$)	112_676_543 ($\textcolor{green}{-2.86\%}$)	1_638_400	59_465 ($\textcolor{red}{0.05\%}$)	133_080 ($\textcolor{green}{-3.46\%}$)	60_509 ($\textcolor{green}{-2.08\%}$)
hashmap_rs	428_466 ($\textcolor{green}{-14.78\%}$)	49_363_168 ($\textcolor{green}{-7.45\%}$)	1_835_008	19_572 ($\textcolor{green}{-7.11\%}$)	58_237 ($\textcolor{green}{-8.43\%}$)	20_805 ($\textcolor{green}{-7.47\%}$)

Priority queue

	binary_size	heapify 50k	mem	pop_min 50	put 50
heap	156_998 ($\textcolor{green}{-13.61\%}$)	688_335_838 ($\textcolor{green}{-13.23\%}$)	1_400_024	338_619 ($\textcolor{green}{-12.12\%}$)	711_943 ($\textcolor{green}{-13.47\%}$)
heap_rs	406_219 ($\textcolor{green}{-14.20\%}$)	4_975_528 ($\textcolor{green}{-1.31\%}$)	819_200	48_902 ($\textcolor{green}{-8.15\%}$)	20_578 ($\textcolor{green}{-6.85\%}$)

MoVM

	binary_size	generate 10k	max mem	batch_get 50	batch_put 50	batch_remove 50
hashmap	169_982 ($\textcolor{green}{-13.37\%}$)	419_486_900 ($\textcolor{green}{-12.14\%}$)	1_820_844	1_113_679 ($\textcolor{green}{-13.74\%}$)	122_781_037 ($\textcolor{green}{-11.60\%}$)	1_054_639 ($\textcolor{green}{-13.70\%}$)
hashmap_rs	428_466 ($\textcolor{green}{-14.78\%}$)	10_178_230 ($\textcolor{green}{-7.34\%}$)	950_272	18_903 ($\textcolor{green}{-7.27\%}$)	57_565 ($\textcolor{green}{-8.49\%}$)	19_747 ($\textcolor{green}{-7.61\%}$)
imrc_hashmap_rs	435_292 ($\textcolor{green}{-15.31\%}$)	19_062_328 ($\textcolor{green}{-4.30\%}$)	1_572_864	29_764 ($\textcolor{green}{-5.57\%}$)	113_802 ($\textcolor{green}{-5.33\%}$)	36_791 ($\textcolor{green}{-2.20\%}$)
movm_rs	1_760_914 ($\textcolor{green}{-15.84\%}$)	999_676_261 ($\textcolor{green}{-1.73\%}$)	2_654_208	2_424_874 ($\textcolor{green}{-2.80\%}$)	6_357_705 ($\textcolor{green}{-1.84\%}$)	5_013_896 ($\textcolor{green}{-1.81\%}$)
movm_dynamic_rs	1_943_858 ($\textcolor{green}{-15.31\%}$)	485_763_587 ($\textcolor{green}{-2.12\%}$)	2_129_920	1_909_424 ($\textcolor{green}{-2.18\%}$)	2_642_175 ($\textcolor{green}{-2.49\%}$)	1_907_002 ($\textcolor{green}{-2.21\%}$)

Basic DAO

	binary_size	init	transfer_token	submit_proposal	vote_proposal
Motoko	242_539 ($\textcolor{green}{-16.79\%}$)	41_042 ($\textcolor{green}{-7.78\%}$)	18_026 ($\textcolor{green}{-9.51\%}$)	12_678 ($\textcolor{green}{-10.71\%}$)	14_924 ($\textcolor{green}{-11.16\%}$)
Rust	751_374 ($\textcolor{green}{-20.11\%}$)	500_487 ($\textcolor{green}{-7.56\%}$)	93_345 ($\textcolor{green}{-8.90\%}$)	114_984 ($\textcolor{green}{-8.37\%}$)	124_724 ($\textcolor{green}{-8.98\%}$)

DIP721 NFT

	binary_size	init	mint_token	transfer_token
Motoko	200_814 ($\textcolor{green}{-17.91\%}$)	12_164 ($\textcolor{green}{-9.08\%}$)	22_455 ($\textcolor{green}{-9.01\%}$)	4_747 ($\textcolor{green}{-11.40\%}$)
Rust	801_533 ($\textcolor{green}{-20.30\%}$)	134_675 ($\textcolor{green}{-6.58\%}$)	348_766 ($\textcolor{green}{-7.22\%}$)	86_803 ($\textcolor{green}{-8.39\%}$)

Heartbeat

	binary_size	heartbeat
Motoko	135_630 ($\textcolor{green}{-13.51\%}$)	8_461 ($\textcolor{green}{-5.76\%}$)
Rust	28_624 ($\textcolor{green}{-19.61\%}$)	830 ($\textcolor{green}{-26.35\%}$)

Timer

	binary_size	setTimer	cancelTimer
Motoko	142_158 ($\textcolor{green}{-13.50\%}$)	17_762 ($\textcolor{green}{-8.80\%}$)	1_706 ($\textcolor{green}{-10.54\%}$)
Rust	447_452 ($\textcolor{green}{-14.67\%}$)	49_589 ($\textcolor{green}{-10.09\%}$)	9_514 ($\textcolor{green}{-8.67\%}$)

Garbage Collection

Note Same as main branch, skipping.

Actor class

	binary size	put new bucket	put existing bucket	get
Map	289_202 ($\textcolor{green}{-12.66\%}$)	748_768 ($\textcolor{green}{-10.18\%}$)	5_609 ($\textcolor{green}{-9.36\%}$)	5_988 ($\textcolor{green}{-8.33\%}$)

Publisher & Subscriber

	pub_binary_size	sub_binary_size	subscribe_caller	subscribe_callee	publish_caller	publish_callee
Motoko	156_672 ($\textcolor{green}{-13.66\%}$)	143_547 ($\textcolor{green}{-13.84\%}$)	15_760 ($\textcolor{green}{-5.31\%}$)	8_489 ($\textcolor{green}{-7.17\%}$)	11_737 ($\textcolor{green}{-6.39\%}$)	3_665 ($\textcolor{green}{-8.40\%}$)
Rust	478_372 ($\textcolor{green}{-14.79\%}$)	527_123 ($\textcolor{green}{-24.33\%}$)	57_647 ($\textcolor{green}{-8.18\%}$)	38_523 ($\textcolor{green}{-9.27\%}$)	81_062 ($\textcolor{green}{-7.86\%}$)	45_691 ($\textcolor{green}{-7.98\%}$)

github-actions[bot] commented 1 year ago

Note The flamegraph link only works after you merge. Unchanged benchmarks are omitted.

Collection libraries

Measure different collection libraries written in both Motoko and Rust. The library names with _rs suffix are written in Rust; the rest are written in Motoko.

We use the same random number generator with fixed seed to ensure that all collections contain the same elements, and the queries are exactly the same. Below we explain the measurements of each column in the table:

generate 50k. Insert 50k Nat32 integers into the collection. For Motoko collections, it usually triggers the GC; the rest of the column are not likely to trigger GC.
max mem. For Motoko, it reports rts_max_live_size after generate call; For Rust, it reports the Wasm's memory page * 32Kb.
batch_get 50. Find 50 elements from the collection.
batch_put 50. Insert 50 elements to the collection.
batch_remove 50. Remove 50 elements from the collection.

💎 Takeaways

The platform only charges for instruction count. Data structures which make use of caching and locality have no impact on the cost.
We have a limit on the maximal cycles per round. This means asymptotic behavior doesn't matter much. We care more about the performance up to a fixed N. In the extreme cases, you may see an O(10000 nlogn) algorithm hitting the limit, while an O(n^2) algorithm runs just fine.
Amortized algorithms/GC may need to be more eager to avoid hitting the cycle limit on a particular round.
Rust costs more cycles to process complicated Candid data, but it is more efficient in performing core computations.

Note

The Candid interface of the benchmark is minimal, therefore the serialization cost is negligible in this measurement.

Due to the instrumentation overhead and cycle limit, we cannot profile computations with large collections. Hopefully, when deterministic time slicing is ready, we can measure the performance on larger memory footprint.

hashmap uses amortized data structure. When the initial capacity is reached, it has to copy the whole array, thus the cost of batch_put 50 is much higher than other data structures.

hashmap_rs uses the fxhash crate, which is the same as std::collections::HashMap, but with a deterministic hasher. This ensures reproducible result.

btree comes from Byron Becker's stable BTreeMap library.

zhenya_hashmap comes from Zhenya Usenko's stable HashMap library.

The MoVM table measures the performance of an experimental implementation of Motoko interpreter. External developers can ignore this table for now.

Map

	binary_size	generate 50k	max mem	batch_get 50	batch_put 50	batch_remove 50
hashmap	169_982	2_097_113_506	9_102_052	1_115_399	609_254_124	1_056_869
triemap	174_030	2_020_134_416	9_715_900	773_637	1_853_794	1_033_460
rbtree	171_127	1_797_995_532	8_902_160	670_401	1_623_975	859_340
splay	170_477	2_040_395_523	8_702_096	1_102_393	1_915_542	1_103_332
btree	198_636	1_875_401_612	7_556_172	813_525	1_718_273	862_047
zhenya_hashmap	165_325	1_642_423_605	9_301_800	647_832	1_447_024	652_030
btreemap_rs	438_979	112_676_543	1_638_400	59_465	133_080	60_509
hashmap_rs	428_466	49_363_168	1_835_008	19_572	58_237	20_805

Priority queue

	binary_size	heapify 50k	mem	pop_min 50	put 50
heap	156_998	688_335_838	1_400_024	338_619	711_943	340_032
heap_rs	406_219	4_975_528	819_200	48_902	20_578	49_090

MoVM

	binary_size	generate 10k	max mem	batch_get 50	batch_put 50	batch_remove 50
hashmap	169_982	419_486_900	1_820_844	1_113_679	122_781_037	1_054_639
hashmap_rs	428_466	10_178_230	950_272	18_903	57_565	19_747
imrc_hashmap_rs	435_292	19_062_328	1_572_864	29_764	113_802	36_791
movm_rs	1_760_914	999_676_261	2_654_208	2_424_874	6_357_705	5_013_896
movm_dynamic_rs	1_943_858	485_763_587	2_129_920	1_909_424	2_642_175	1_907_002

Sample Dapps

Measure the performance of some typical dapps:

Basic DAO, with heartbeat disabled to make profiling easier. We have a separate benchmark to measure heartbeat performance.
DIP721 NFT

Note

The cost difference is mainly due to the Candid serialization cost.

Motoko statically compiles/specializes the serialization code for each method, whereas in Rust, we use serde to dynamically deserialize data based on data on the wire.

We could improve the performance on the Rust side by using parser combinators. But it is a challenge to maintain the ergonomics provided by serde.

For real-world applications, we tend to send small data for each endpoint, which makes the Candid overhead in Rust tolerable.

Basic DAO

	binary_size	init	transfer_token	submit_proposal	vote_proposal
Motoko	242_539	41_042	18_026	12_678	14_924
Rust	751_374	500_487	93_345	114_984	124_724

DIP721 NFT

	binary_size	init	mint_token	transfer_token
Motoko	200_814	12_164	22_455	4_747
Rust	801_533	134_675	348_766	86_803

Heartbeat / Timer

Measure the cost of empty heartbeat and timer job.

setTimer measures both the setTimer(0) method and the execution of empty job.
It is not easy to reliably capture the above events in one flamegraph, as the implementation detail of the replica can affect how we measure this. Typically, a correct flamegraph contains both setTimer and canister_global_timer function. If it's not there, we may need to adjust the script.

Heartbeat

	binary_size	heartbeat
Motoko	135_630	8_461
Rust	28_624	830

Timer

	binary_size	setTimer	cancelTimer
Motoko	142_158	17_762	1_706
Rust	447_452	49_589	9_514

Motoko Specific Benchmarks

Measure various features only available in Motoko.

Garbage Collection. Measure Motoko garbage collection cost using the Triemap benchmark. The max mem column reports rts_max_live_size after generate call. The cycle cost numbers reported here are garbage collection cost only. Some flamegraphs are truncated due to the 2M log size limit. The dfx/ic-wasm optimizer is disabled for the garbage collection test cases due to how the optimizer affects function names, making profiling trickier.
- default. Compile with the default GC option. With the current GC scheduler, generate will trigger the copying GC. The rest of the methods will not trigger GC.
- copying. Compile with --force-gc --copying-gc.
- compacting. Compile with --force-gc --compacting-gc.
- generational. Compile with --force-gc --generational-gc.
Actor class. Measure the cost of spawning actor class, using the Actor classes example.

Garbage Collection

	generate 80k	max mem	batch_get 50	batch_put 50	batch_remove 50
default	247_113_104	15_539_816	50	50	50
copying	247_113_054	15_539_816	247_107_545	247_259_605	247_259_929
compacting	409_743_010	15_539_816	308_335_419	367_295_137	351_658_670
generational	625_110_580	15_540_080	56_690	1_100_091	622_657

Actor class

	binary size	put new bucket	put existing bucket	get
Map	289_202	748_768	5_609	5_988

Publisher & Subscriber

Measure the cost of inter-canister calls from the Publisher & Subscriber example.

	pub_binary_size	sub_binary_size	subscribe_caller	subscribe_callee	publish_caller	publish_callee
Motoko	156_672	143_547	15_760	8_489	11_737	3_665
Rust	478_372	527_123	57_647	38_523	81_062	45_691

kentosugama commented 1 year ago

Just updated the README.md

dfinity / canister-profiling

Enable wasm optimizer from `dfx 0.14.0` #55

Map

Priority queue

MoVM

Basic DAO

DIP721 NFT

Heartbeat

Timer

Garbage Collection

Actor class

Publisher & Subscriber

Collection libraries

💎 Takeaways

Map

Priority queue

MoVM

Sample Dapps

Basic DAO

DIP721 NFT

Heartbeat / Timer

Heartbeat

Timer

Motoko Specific Benchmarks

Garbage Collection

Actor class

Publisher & Subscriber