Closed piotr-dziubecki closed 2 years ago
I performed a detailed profiling analysis of said deploy and it lead to interesting findings and we could also fix a few low-hanging fruits we identified with @goral09 ( #2393 #2394 #2401 #2414 #2399 #todo_pointerblock #todo_inmem_hang #trie_overhead with further plans in preparation). Tests were conducted on top of the 1.4.1 branch without and with said fixes backported. We were able to go down from 12.5s to 9s with further plans for optimizations.
Contract Wasm was extracted from testnet and analyzed. An identical minimized contract was developed that produces identical effects.
Entrypoint from the testnet deploy page is create_domains
and results in 49500 dictionary writes. Writes uses the same seed address and when executed multiple times it will try to write to the same Key::Dictionary
addresses.
I used perf
and valgrind
RUSTC_LINKER=$(which clang) RUSTFLAGS="-Clink-arg=-fuse-ld=lld -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=y" cargo build --release --bin simple-transfer
and
valgrind --tool=callgrind ./target/release/simple-transfer
Trie::to_bytes
and there is potential to perform the Trie
-level cache as LMDB storage seems to always get and put tries by using ToBytes/FromBytes (Trie::to_bytes is called about 212568 times during execution; see below). We can consider saving Tries in memory and once write
phase is done we can flush the from the cache to triesThis contract wrote exactly 49500 dictionary entries and LMDB grows to 848MB after genesis, installation of the contract, and stored contract execution of create_domains
entry-point.
Key
and DictionaryValue
(the raw values) occupied 5MB.create_domains
storage size as measured on disk through du -sh
is 848MB.create_domains
storage size as measured on disk du -sh
is 848MB. It does not grow since 1st run. Keep in mind as measured in bytes it does change as we need to create a new post-state hash etc.InMemoryGlobalState
and after measuring raw bytes it was 608MB. LMDB overhead is 240MB.After analysis, I identified a PointerBlock serialization inefficiency where we always serialize None variants of the PointerBlock and it always weights a 256bytes at a minimum due to Option
's tag overhead. As an experiment, I changed the format to serialize an index of Some variants in the PointerBlock and I could save 40MB (ticket #todo_pointerblock )
After executing create_domains
to create 49500 entries I performed global state analysis down to trie level
// total tries: 212568
// leaf_count: 49545
# All Trie::Node variants
// pointer_block_count: 163097
# Total number of Some variants in PointerBlock
// pointer_block_pointers: 16807268
# Total number of all Option<Pointer> values in PointerBlock (multiply of RADIX)
// pointer_block_pointers_count: 41752832
# pointer_block_pointers/pointer_block_pointers_count
// pointer_block_fill_ratio: 0.4025419880500561
# Trie::Extension variants
// extension_count: 48
# All possible CLTypes stored in the global state. Any is the DictionaryValue wrapper
// cl_types: {U32: 1, U64: 5, U512: 9, Unit: 6, Map { key: U64, value: Map { key: PublicKey, value: Any } }: 1, Map { key: String, value: ByteArray(32) }: 1, Tuple2([U512, U512]): 1, Any: 49500}
# All possible variants of StoredValue
// stored_values: {CLValue: 49524, Account: 4, ContractWasm: 5, Contract: 5, ContractPackage: 5, DeployInfo: 2}
# All possible variants of a Key
// keys: {Account: 4, Hash: 15, URef: 15, DeployInfo: 2, Balance: 8, Dictionary: 49500, SystemContractRegistry: 1}
# All affixes length
// affix lengths: {1}
# All possible leaf node leafs from both Trie::Node and Trie::Extension
// pointer_leaf_count: 3128828
# All possible pointer node leafs from both Trie::Node and Trie::Extension
// pointer_node_count: 13678488
// unique_pointer_leafs: 49545
// unique_pointer_nodes: 113599
# unreachable pointers - 0 means global state is consistent
// unreachable_node_pointers: 0
// unreachable_leaf_pointers: 0
Overhead for 49500 writes produces exactly 212568 tries in the global state which is a huge overhead. We should perform further analysis to see if we can produce fewer intermediate trie entries in the global state during smart contract execution. (#trie_overhead)
https://testnet.cspr.live/deploy/9a47e26df624a237ac2d76da6707ff403423d343feb4de2f8886c4f4a17b3b29 Stored 50000+ dictionary entries that look to be dynamically generated from a stored contract call. { "StoredContractByHash": { "hash": "a2cfd09d37adea7b26ffdfdaa1191fe2f597aafb5b4277c41c18db012d911d53", "entry_point": "create_domains", "args": [ [ "number", { "cl_type": "U64", "bytes": "50c3000000000000", "parsed": 50000 } ] ] } } gas cost: 2,874.50723 CSPR fiat cost approx: $330 Monitoring from a validator: https://files.slack.com/files-pri/TDVFB45LG-F02M099SJ6Q/image.png Read pegged at 150 MB/s for 4.5 mins on our non-validating nodes https://grafana.casperlabs.io/d/24t3s9Dnz/testnet-casper?orgId=1&from=1636555157167&to=1636555823356 This caused a 4 min pause on TestNet. Currently syncing TestNet up through this and we will see if data.lmdb growth matches the 3 GB just shown in validator metrics.