ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.18k stars 289 forks source link

Organize and Update Metrics #7098

Open matthewkeil opened 1 month ago

matthewkeil commented 1 month ago

Problem description

There are a few goals that this issue is meant to discuss/cover.

Our collection of metrics has grown over time and the seems to be a lot of technical debt that needs to be addressed. The big thing is the organization of the metrics we currently collect. We have two giant files that are not very well organized so it is becoming harder to maintain. I would like to propose cleaning up the organization by shuffling the metrics into a few separate files so that they are easier to maintain.

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

There is also another creeping issue that I think should be addressed. We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Solution description

Creating packages/metrics Folder

I would like to suggest that we move the metrics out of the beacon-node package and into its own package. This will make it easier to import the Metrics interface into other packages. I also am thinking we can export a singleton metrics object from the package. The package could export a getter function that obtains the instance of the singleton metrics object for addition to the chain object as it is today. But this would open the possibility of getting the metrics object in any help function or the other packages without needing to pass the chain object around. Not sure if this is strictly necessary but might make our lives easier going forward.

As an example we have metrics in reqresp and in state-transition that could easily be centralized.

Nesting Metrics

I am not sure if added a lot of nested domains will make using the metrics object easier or more complicated. This is the first thing that should be decided. Is it better to nest similar metrics within domains, under the metrics object or should we keep a flat structure so its easier to find what one is looking for using intellisense

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

My gut tells me that some minimal nesting will be nice due to the large number of metrics that we collect but using a deeply nested structure may become a pain to deal with as some metrics tend to cross domain boundaries and it will be challenging to decide "where" to put things.

metrics.chain.bls.multithread.aggregatedPublicKeysCount() vs metrics.bls.aggregatedPublicKeys

I actually kinda flip-flopped on this while writing this up so its worth discussing further as I'm not really sure which will be best myself.

File Structure

I think we really need to split up the two giant files into a number of smaller files so its easier to maintain. Seems like an easy win.

lodestar Metric Prefix

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly. The only reason would be to differentiate "ethereum standard" metrics from our lodestar specifics but I am not sure that is necessary. I do think however that we should be prefixing with the "domain" a metric represents and having that domain match the filename seems like a nice to have so its clear where it is without having to do a global file search.

Additional context

No response

nflaig commented 1 month ago

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

We should be implementing at least the interop metrics which seems like we do. Other metrics I would probably not add if nobody is asking for it and we don't deem them useful.

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

nesting seems useful to differentiate metrics and add more context imo

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly.

I think we should prefix non-spec'd metrics with lodestar. I don't think the prefix is overkill, it's a good practice to use the software name as prefix to avoid clashes if users collect metrics from a lot of different services.

Creating packages/metrics Folder

was thinking about that too at one point but there is really not that much code which is why I decided to add the metric types required by a lot of packages to packages/utils and the code to run the metrics server, create registry, etc. seemed fine to keep in beacon-node and wire it up in packages/cli.

We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Where is this the case? Last time I looked at this https://github.com/ChainSafe/lodestar/pull/6201 and refactored / moved the code around, I removed all duplications and types should only be imported from utils package.

File Structure

Seems reasonable to split it up into multiple files although I don't think the reason the metrics grow over time is because of using a single file. I think the problem is rather that we don't remove metrics unless the code is removed, but we could re-evaluate some metrics added during initial implementation and remove them if the data does not seem necessary, especially if those metrics are not even part of any panel.

matthewkeil commented 1 month ago

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

We should be implementing at least the interop metrics which seems like we do. Other metrics I would probably not add if nobody is asking for it and we don't deem them useful.

I tend to disagree. I think we should meet the ethereum/beacon-metrics spec just like the other specs. We cannot be sure who or why is expecting those so it is best to include them. The cost to do so is minimal and its the right thing to do as good stewards of the protocol

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

nesting seems useful to differentiate metrics and add more context imo

👍

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly.

I think we should prefix non-spec'd metrics with lodestar. I don't think the prefix is overkill, it's a good practice to use the software name as prefix to avoid clashes if users collect metrics from a lot of different services.

I think we should use "beacon" similar to the spec'd metrics. There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics." Those feel like one and the same. But this is not a hill i would die on and am open to suggestions.

Creating packages/metrics Folder

was thinking about that too at one point but there is really not that much code which is why I decided to add the metric types required by a lot of packages to packages/utils and the code to run the metrics server, create registry, etc. seemed fine to keep in beacon-node and wire it up in packages/cli.

I think creating a separation of concerns is a nice to have here. In particular with regards to the next response about circular dependency.

We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Where is this the case? Last time I looked at this #6201 and refactored / moved the code around, I removed all duplications and types should only be imported from utils package.

At a minimum I know there are metrics in reqresp and in state-transition. I found it frustrating to add metrics for async shuffling because that happens in state-transition and can imagine as the codebase grows there will be other instances where it is nice to add metrics places, other than in beacon-node.. As a note, to accommodate the metrics for async shuffling I needed to make an interface that meets the metrics object that gets passed in so that they were available. This is not a great idea as there are now two places that those definitions exist (interface in state-transition and implementation in beacon-node). It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

File Structure

Seems reasonable to split it up into multiple files although I don't think the reason the metrics grow over time is because of using a single file. I think the problem is rather that we don't remove metrics unless the code is removed, but we could re-evaluate some metrics added during initial implementation and remove them if the data does not seem necessary, especially if those metrics are not even part of any panel.

Agreed. Seems like a nice to have so its more maintainable than 1000+ line files... Will also make it more meaningful to maintain (ie remove unused metrics) if its easier to grok what is cooking.

nflaig commented 1 month ago

The cost to do so is minimal and its the right thing to do as good stewards of the protocol

The problem with the beacon-metrics seems that it is not really well maintained, we might wanna clean up there first, see which metrics are actually implemented by other clients at this point.

There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics."

It's good to know as a consumer, assuming all clients implement the standard beacon metrics you know which ones are interchangeable and which ones are client specific.

It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

you mean similar to what we do in reqresp?

https://github.com/ChainSafe/lodestar/blob/bb40ef7eb77f9441ec66ab8f1fdf6664240bb1a3/packages/reqresp/src/metrics.ts#L9

matthewkeil commented 1 month ago

The cost to do so is minimal and its the right thing to do as good stewards of the protocol

The problem with the beacon-metrics seems that it is not really well maintained, we might wanna clean up there first, see which metrics are actually implemented by other clients at this point.

Correct. That is the heart of this issue/effort.

There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics."

It's good to know as a consumer, assuming all clients implement the standard beacon metrics you know which ones are interchangeable and which ones are client specific.

I suppose I see both sides of this.

The only consumers of these metrics should be dashboards and in an idea world the dashboards should look the similar across clients. It would be great to be able to standardize metrics methodology across all clients and then for implementation specific dashboards the data would just not show up.

IE GC and event loop would not propagate on lighthouse but all the fork choice ones should be similar and plug-and-play

Prefixing with lodestar only seem relevant for NodeJs like stuff that would never be relevant on Lighthouse but for all others that "could" show up on any beacon node i think we should stick to prefixing with the domain that is represented.

This will help to push the standardization effort that should be a thing for all Ethereum. We could be thought leaders in this space if-so... I have started to add context and am interested in working more on the ethereum/beacon-metrics repo toward this end.

It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

you mean similar to what we do in reqresp?

https://github.com/ChainSafe/lodestar/blob/bb40ef7eb77f9441ec66ab8f1fdf6664240bb1a3/packages/reqresp/src/metrics.ts#L9

This is exactly the issue that I think should be solved. All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

In particular the most egregious is the state-transition interface that requires duplication for type/interface and beacon-node based implementation.

The other thing that stands out to me is that we have gossipsub metrics in both. This should not be true, they should be together for both organizational and maintenance reasons.

The stuff I am also not sure how to handle, but think should also be dealt with is the worker metrics. Those will need a separate instance of the RegistryMetricCreator but having the actual function that does the individual register.*({}) should all be together with the other metrics that are collected. We do not separate the different dashboards and in similar fashion I do not think that we should be segregating the metrics either.

nflaig commented 1 month ago

All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

Not sure I agree here, e.g. in the monitoring service it makes much more sense to define the metrics in the constructor instead of somewhere central. This makes it much easier to see where they are used and only adds them if the monitoring service is enabled, I like this pattern.

matthewkeil commented 1 month ago

All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

Not sure I agree here, e.g. in the monitoring service it makes much more sense to define the metrics in the constructor instead of somewhere central. This makes it much easier to see where they are used and only adds them if the monitoring service is enabled, I like this pattern.

I suppose monitoring service could be separate, but not sure why it should be. Its just as easy to have a few exports from packages/metrics and import the correct one into the correct process. Not a hill im willing to die on though for the 5 metrics used by that service.

But big picture is that we have a unified place "where metrics live" kind of like we have a unified place "that dashboards live"

matthewkeil commented 1 month ago

I think the bigger picture though is for something like gossipsup metrics. I am trying to add the ones requested for peerDas and having a hard time with it. Need to dig through how all the metrics are added and still not entirely sure but I know that they need to be added here:

https://github.com/ChainSafe/lodestar/blob/mkeil/standardized-peerdas-metrics/packages/beacon-node/src/network/core/networkCore.ts#L339

However gossip lives both on the worker thread and on the main thread depending on what stage of the process one is attempting to get telemetry from.

Adding the data collectors and then getting access to them is proving a digging exercise.... Having a unified place where they live will also lend itself to a README.md that could explain how/where/why to add metrics so things wire together correctly.

philknows commented 1 month ago

I think we should discuss this tomorrow as well with additional perspectives, but to summarize from what I'm seeing so far with the proposed solutions:

Creating packages/metrics Folder

Arguments For:

Questions:

Nesting Metrics

Arguments For:

Questions

File Structure

Arguments For:

Questions

Lodestar Metric Prefix

Arguments For:

Questions: