benchmark parachain and standalone chain

based on my review of previous discussion between Alan S, Basti and Sergei in Element's Parachain Technical room, Alan S shared how he profiled their parachain block authority execution time for benchmarking and stack analysis with trace debugging as follows:

profiled a parachain's block authority execution time for benchmarking and stack analysis with trace debugging

run your node using flags --dev, -lsync=trace, -lsub-libp2p=trace
run perf record -F 999 -p <pid_of_your_node> --call-graph dwarf
wait for the block to be produced by your node and then Ctrl+C to stop the perf (you can keep the node running to repeat later)
get the perf script perf script --no-inline > perf.script.data
open it at https://www.speedscope.app to view execution (i.e. perf.basti-cache-runtime-fix.data from PR #9611 shared in Element's "Parachain Technical" room)

they were using the default cumulus authorship deadline is 500ms (i.e. 12000(1/24) = SLOT_DURATION block_proposal_slot_portion), where SLOT_DURATION equals their MILLISECS_PER_BLOCK.

but for the DataHighway's Westlake, we're currently using 4320 for MILLISECS_PER_BLOCK, so our slot duration is much less at 180ms, so maybe we need to change it to the following (i.e. 500/4230 and 750/4320 if we want 500ms as our cumulus authorship deadline too

// We got around 500ms for proposing
block_proposal_slot_portion: SlotProportion::new(1f32 / 8f32),
// And a maximum of 750ms if slots are skipped
max_block_proposal_slot_portion: Some(SlotProportion::new(1f32 / 6f32)),

Note that in the polkadot repo https://github.com/paritytech/polkadot, both millau and rialto are using 6000 for MILLISECS_PER_BLOCK, and they are using block_proposal_slot_portion: SlotProportion::new(2f32 / 3f32), and max_block_proposal_slot_portion: None,

Alan S they discovered that their 500ms was split up as follows:

500ms - parachain block authoring 140ms - reserved for initialization/finalization (i.e. sc_basic_authorship::basic_authorship) 65% - block production (i.e. including verifying extrinsic signatures for inclusion) 35% - block finalization 360ms - applying extrinsics and overhead (apply_extrinsic) 25% - overhead retrieving runtime_code() from storage cached (i.e. sc_client_db::storage_cache) runtime_code() (only if there is no new runtime code, otherwise fetch it from TrieBackend) 50% - overhead of runtime_code() execution blake2 related before each extrinsic is applied apply_extrinsic_call_at...contextual_call/runtime_code with blake2 (when running node with --dev there isn't this overhead) 25% - apply extrinsics extrinsic.check (i.e. ecdsa signature verification) (requires ~100ms for 100 extrinsics using system::remark)

then Basti created this PR https://github.com/paritytech/substrate/pull/9611 that resulted in an improvement with basic extrinsics from 180tx/block max to 450tx/block

i believe we need to:

profile our parachain using perf as mentioned previously with the kinds of extrinsics we'll be using to undertake benchmarking and stack analysis of the block authoring execution time, and use trace debugging to determine whether we need to:
- increase the block proposal cumulus deadline (i.e. block_proposal_slot_portion) to compensate for production overhead (see https://github.com/paritytech/substrate/pull/9611 that increased the amount of transactions per block by ~3x)
- re-evaluate the ExtrinsicBaseWeight we are using in the fork of Substrate that we are using as dependencies
- check whether we need to change the leniency strategy used by the block_proposal_slot_portion in the fork of Susbtrate we are using as dependencies (i.e. change from Exponential to Linear for sc_consensus_slots::SlotLenienceType in sc_consensus_slots::proposing_remaining_duration
- learn about benchmarking and apply it https://substrate.dev/docs/en/knowledgebase/runtime/benchmarking

note: some user mentioned that "transactions take progressively longer the later they go into a block in a linear way"

here are extracts of relevant parts of codebases that we should consider in possible changes in our 'ilya/parachain-update' branch:

extract from https://github.com/substrate-developer-hub/substrate-parachain-template

pub const MILLISECS_PER_BLOCK: u64 = 12000;
pub const SLOT_DURATION: u64 = MILLISECS_PER_BLOCK;

// We got around 500ms for proposing
block_proposal_slot_portion: SlotProportion::new(1f32 / 24f32),
// And a maximum of 750ms if slots are skipped
max_block_proposal_slot_portion: Some(SlotProportion::new(1f32 / 16f32)),

...

/// We assume that ~10% of the block weight is consumed by `on_initalize` handlers.
/// This is used to limit the maximal weight of a single extrinsic.
const AVERAGE_ON_INITIALIZE_RATIO: Perbill = Perbill::from_percent(10);
/// We allow `Normal` extrinsics to fill up the block up to 75%, the rest can be used
/// by  Operational  extrinsics.
const NORMAL_DISPATCH_RATIO: Perbill = Perbill::from_percent(75);
/// We allow for 0.5 of a second of compute with a 12 second average block time.
const MAXIMUM_BLOCK_WEIGHT: Weight = WEIGHT_PER_SECOND / 2;

extract from https://github.com/paritytech/substrate

pub const WEIGHT_PER_SECOND: Weight = 1_000_000_000_000;
pub const WEIGHT_PER_MILLIS: Weight = WEIGHT_PER_SECOND / 1000; // 1_000_000_000
pub const WEIGHT_PER_MICROS: Weight = WEIGHT_PER_MILLIS / 1000; // 1_000_000

/// Executing 10,000 System remarks (no-op) txs takes ~1.26 seconds -> ~125 µs per tx
pub const ExtrinsicBaseWeight: Weight = 125 * WEIGHT_PER_MICROS;

DataHighway-DHX / node

benchmark parachain and standalone chain #232