casper-network / casper-node

Reference client for CASPER protocol
https://casper.network
Apache License 2.0
391 stars 222 forks source link

Large Dictionary Deploy. #2346

Closed piotr-dziubecki closed 2 years ago

piotr-dziubecki commented 3 years ago

https://testnet.cspr.live/deploy/9a47e26df624a237ac2d76da6707ff403423d343feb4de2f8886c4f4a17b3b29 Stored 50000+ dictionary entries that look to be dynamically generated from a stored contract call. { "StoredContractByHash": { "hash": "a2cfd09d37adea7b26ffdfdaa1191fe2f597aafb5b4277c41c18db012d911d53", "entry_point": "create_domains", "args": [ [ "number", { "cl_type": "U64", "bytes": "50c3000000000000", "parsed": 50000 } ] ] } } gas cost: 2,874.50723 CSPR fiat cost approx: $330 Monitoring from a validator: https://files.slack.com/files-pri/TDVFB45LG-F02M099SJ6Q/image.png Read pegged at 150 MB/s for 4.5 mins on our non-validating nodes https://grafana.casperlabs.io/d/24t3s9Dnz/testnet-casper?orgId=1&from=1636555157167&to=1636555823356 This caused a 4 min pause on TestNet. Currently syncing TestNet up through this and we will see if data.lmdb growth matches the 3 GB just shown in validator metrics.

mpapierski commented 2 years ago

I performed a detailed profiling analysis of said deploy and it lead to interesting findings and we could also fix a few low-hanging fruits we identified with @goral09 ( #2393 #2394 #2401 #2414 #2399 #todo_pointerblock #todo_inmem_hang #trie_overhead with further plans in preparation). Tests were conducted on top of the 1.4.1 branch without and with said fixes backported. We were able to go down from 12.5s to 9s with further plans for optimizations.

Contract details

Contract Wasm was extracted from testnet and analyzed. An identical minimized contract was developed that produces identical effects.

Entrypoint from the testnet deploy page is create_domains and results in 49500 dictionary writes. Writes uses the same seed address and when executed multiple times it will try to write to the same Key::Dictionary addresses.

CPU analysis

I used perf and valgrind

RUSTC_LINKER=$(which clang) RUSTFLAGS="-Clink-arg=-fuse-ld=lld -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=y" cargo build --release --bin simple-transfer

image

and

valgrind --tool=callgrind ./target/release/simple-transfer 

Storage analysis

This contract wrote exactly 49500 dictionary entries and LMDB grows to 848MB after genesis, installation of the contract, and stored contract execution of create_domains entry-point.

Data size:

After analysis, I identified a PointerBlock serialization inefficiency where we always serialize None variants of the PointerBlock and it always weights a 256bytes at a minimum due to Option's tag overhead. As an experiment, I changed the format to serialize an index of Some variants in the PointerBlock and I could save 40MB (ticket #todo_pointerblock )

Trie-level analysis

After executing create_domains to create 49500 entries I performed global state analysis down to trie level

// total tries: 212568
// leaf_count: 49545
# All Trie::Node variants
// pointer_block_count: 163097
# Total number of Some variants in PointerBlock
// pointer_block_pointers: 16807268
# Total number of all Option<Pointer> values in PointerBlock (multiply of RADIX)
// pointer_block_pointers_count: 41752832
# pointer_block_pointers/pointer_block_pointers_count
// pointer_block_fill_ratio: 0.4025419880500561
# Trie::Extension variants
// extension_count: 48
# All possible CLTypes stored in the global state. Any is the DictionaryValue wrapper
// cl_types: {U32: 1, U64: 5, U512: 9, Unit: 6, Map { key: U64, value: Map { key: PublicKey, value: Any } }: 1, Map { key: String, value: ByteArray(32) }: 1, Tuple2([U512, U512]): 1, Any: 49500}
# All possible variants of StoredValue
// stored_values: {CLValue: 49524, Account: 4, ContractWasm: 5, Contract: 5, ContractPackage: 5, DeployInfo: 2}
# All possible variants of a Key
// keys: {Account: 4, Hash: 15, URef: 15, DeployInfo: 2, Balance: 8, Dictionary: 49500, SystemContractRegistry: 1}
# All affixes length
// affix lengths: {1}
# All possible leaf node leafs from both Trie::Node and Trie::Extension 
// pointer_leaf_count: 3128828
# All possible pointer node leafs from both Trie::Node and Trie::Extension 
// pointer_node_count: 13678488
// unique_pointer_leafs: 49545
// unique_pointer_nodes: 113599
# unreachable pointers - 0 means global state is consistent
// unreachable_node_pointers: 0
// unreachable_leaf_pointers: 0

Overhead for 49500 writes produces exactly 212568 tries in the global state which is a huge overhead. We should perform further analysis to see if we can produce fewer intermediate trie entries in the global state during smart contract execution. (#trie_overhead)