I performed a detailed profiling analysis of said deploy and it lead to interesting findings and we could also fix a few low-hanging fruits we identified with @goral09 ( #2393 #2394 #2401 #2414 #2399 #todo_pointerblock #todo_inmem_hang #trie_overhead with further plans in preparation). Tests were conducted on top of the 1.4.1 branch without and with said fixes backported. We were able to go down from 12.5s to 9s with further plans for optimizations.

Contract details

Contract Wasm was extracted from testnet and analyzed. An identical minimized contract was developed that produces identical effects.

Entrypoint from the testnet deploy page is create_domains and results in 49500 dictionary writes. Writes uses the same seed address and when executed multiple times it will try to write to the same Key::Dictionary addresses.

CPU analysis

I used perf and valgrind

RUSTC_LINKER=$(which clang) RUSTFLAGS="-Clink-arg=-fuse-ld=lld -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=y" cargo build --release --bin simple-transfer

and

valgrind --tool=callgrind ./target/release/simple-transfer

I could save ~2s by removing ScopedInstrumenter which in the end used syscalls to get the time (#2393)
We identified big ToBytes inefficiency that accounted for ~17.3% of time spent in serialization. It is already ticketed and a simple fix decreased the time spend in serialization. #2412 #2414
There's also Trie::to_bytes and there is potential to perform the Trie-level cache as LMDB storage seems to always get and put tries by using ToBytes/FromBytes (Trie::to_bytes is called about 212568 times during execution; see below). We can consider saving Tries in memory and once write phase is done we can flush the from the cache to tries

Storage analysis

This contract wrote exactly 49500 dictionary entries and LMDB grows to 848MB after genesis, installation of the contract, and stored contract execution of create_domains entry-point.

Data size:

Key and DictionaryValue (the raw values) occupied 5MB.
After 1st call to create_domains storage size as measured on disk through du -sh is 848MB.
After 2nd and subsequent calls to create_domains storage size as measured on disk du -sh is 848MB. It does not grow since 1st run. Keep in mind as measured in bytes it does change as we need to create a new post-state hash etc.
The same contract was executed using InMemoryGlobalState and after measuring raw bytes it was 608MB. LMDB overhead is 240MB.
2nd run with InMem hangs the process (bug #todo_inmem_hang)

After analysis, I identified a PointerBlock serialization inefficiency where we always serialize None variants of the PointerBlock and it always weights a 256bytes at a minimum due to Option's tag overhead. As an experiment, I changed the format to serialize an index of Some variants in the PointerBlock and I could save 40MB (ticket #todo_pointerblock )

Trie-level analysis

After executing create_domains to create 49500 entries I performed global state analysis down to trie level

// total tries: 212568
// leaf_count: 49545
# All Trie::Node variants
// pointer_block_count: 163097
# Total number of Some variants in PointerBlock
// pointer_block_pointers: 16807268
# Total number of all Option<Pointer> values in PointerBlock (multiply of RADIX)
// pointer_block_pointers_count: 41752832
# pointer_block_pointers/pointer_block_pointers_count
// pointer_block_fill_ratio: 0.4025419880500561
# Trie::Extension variants
// extension_count: 48
# All possible CLTypes stored in the global state. Any is the DictionaryValue wrapper
// cl_types: {U32: 1, U64: 5, U512: 9, Unit: 6, Map { key: U64, value: Map { key: PublicKey, value: Any } }: 1, Map { key: String, value: ByteArray(32) }: 1, Tuple2([U512, U512]): 1, Any: 49500}
# All possible variants of StoredValue
// stored_values: {CLValue: 49524, Account: 4, ContractWasm: 5, Contract: 5, ContractPackage: 5, DeployInfo: 2}
# All possible variants of a Key
// keys: {Account: 4, Hash: 15, URef: 15, DeployInfo: 2, Balance: 8, Dictionary: 49500, SystemContractRegistry: 1}
# All affixes length
// affix lengths: {1}
# All possible leaf node leafs from both Trie::Node and Trie::Extension 
// pointer_leaf_count: 3128828
# All possible pointer node leafs from both Trie::Node and Trie::Extension 
// pointer_node_count: 13678488
// unique_pointer_leafs: 49545
// unique_pointer_nodes: 113599
# unreachable pointers - 0 means global state is consistent
// unreachable_node_pointers: 0
// unreachable_leaf_pointers: 0

Overhead for 49500 writes produces exactly 212568 tries in the global state which is a huge overhead. We should perform further analysis to see if we can produce fewer intermediate trie entries in the global state during smart contract execution. (#trie_overhead)

casper-network / casper-node

Large Dictionary Deploy. #2346

Contract details

CPU analysis

Storage analysis

Data size:

Trie-level analysis