eqlabs / pathfinder

A Starknet full node written in Rust
https://eqlabs.github.io/pathfinder/
Other
625 stars 231 forks source link

Metrics and tracing for p2p stages #2050

Open Mirko-von-Leipzig opened 4 months ago

Mirko-von-Leipzig commented 4 months ago

p2p sync is essentially a tree of processing tasks called stages, operating concurrently and connected via SPSC channels of some capactiy. Adding metrics and tracing to stages will greatly simplify debugging and identification of bottlenecks.

What will (presumably) occur is that some stages will be the slow point, causing its input channel to block the system. Knowing which stages are slow will show where to add parallelisation and/or increasing the channel capacity.


My take on this

Disclaimer - this is just my opinion without having attempted this, you might come to a different conclusion.

We are interested in at least three pieces of information for each stage

  1. Which stage is this?
  2. Processing time
  3. Channel fullness

(1) - Use a &'static str to identify each stage. We can add this to the Stage trait, but this fails to uniquely ID a stage if there are duplicates involved. One alternative is to add it as an additional input parameter to the pipe function. This works, though it would also be nice if one could include some tree-like ID that would allow a system diagram UI to be drawn - but this is completely unecessary, just nice to explain visually what's going on. One could manually assign these IDs within stage names, but it should also be possible to do this at compile time if one adds functionality to the channel type (to pass on this type info somehow). But this is overkill.

(2) - this can just be a simple timer inside the pipe function which measures the execution time of the Stage in each iteration. Only issue is that some stages occur after try_buffer calls which means they execute over a vector of items, making the processing times incomparable. It would be possible to account for this by creating a BufferedReceiver type, but now we're adding more "duplicate" types just so we can log a bit better. I would hesitate to do this until the sync framework has proven mature. We might need to add many such types. Or none at all. Or maybe its trivial to perform this with a wrapper type and deref..

(3) - A channel's "fullness" can be determined using the capacity and max_capacity methods.

I'm unsure about the trace level - probably debug? You might also want to select certain stages.

DEBUG stage=block_hash_verification time=10ms in_queue=3/10 Item processed

We should also create a template to display these stats, probably on three line-charts (one per information piece).