Monitor CPU usage and internally rate limit

carver commented 1 day ago

What's wrong?

Especially with the state network, and when onboarding a new node, trin can redline the CPU while accepting offers and re-offering to peers. (I tended to see my fans spinning up when trin reached about 10 offers accepted per second).

Just like with storage, we don't want to overuse CPU. We prefer the client to feel light, and as something that can be constantly running in the background.

Possible solution

Monitor cpu_time::ProcessTime inside trin, and cap it at some small % by default (2%? 5%? 10%?). Add a CLI flag to manually configure it.

Every 10s that the CPU is above the limit, shrink the data radius of the state network by 10% of its current level. (This is making the assumption, based on the current experience, that state is the only offender for high CPU usage). Every 10s that the CPU is below the limit by at least half, and the data storage is under target, then grow the radius by 10%. I think we want to be quite responsive, which this accomplishes by being willing to cut the radius in ~half in 1 minute.

This approach might actually accelerate state nodes finding their natural "true" radius point faster, and mitigate the fill & dump behavior when launching a fresh client (which is more slow and painful on state than history).

Challenges

trin could use CPU for other reasons, like due to user interaction (ie~ when someone is using the RPC API). It is easy to imagine that CPU usage will spike then, and it would be wrong to mess with state radius at that point. We probably cannot punt on this, and will need to include a solution with the first implementation.

Another awkward aspect of this approach is that it's hard to tell which network is using too much CPU. We probably don't want to adjust every network's radius at once if CPU is high. Right now, it's only ever state, so I think we can punt this challenge. But if multiple networks start using a lot of CPU, we might want a clever way to measure computation that isn't just checking cpu_time::ProcessTime.

Timeline

I won't try to get this into the imminent stable release, of course, just planning ahead. This is another good reason not to enable state by default.

morph-dev commented 1 day ago

Do we know (and if not, can we measure) what exactly is causing high CPU usage? Is it the process of validating state content, gossiping that content or something else?

Instead of shrinking/growing radius, should we maybe reduce/increase number of parallel requests that we are accepting?

I think your proposed solution is good first step.

KolbyML commented 1 day ago

This approach might actually accelerate state nodes finding their natural "true" radius point faster, and mitigate the fill & dump behavior when launching a fresh client (which is more slow and painful on state than history).

Hmm maybe in that case we should just default our total radius value to closer to the target value. Because from the Challenges section maybe checking CPU has too many possible headcase.

I wonder if any EL or Cl client checks CPU usage to determine there stuff.

If the storage limit is 1GB and on the lines of what Jason said the node would prune a majority of what it accept which would end up being wasted CPU cycles. So maybe targeting the ideal radius or closer to at the start is what we should do in general.

So maybe this is just a sign a 100% default is a bad idea.

Is there any value in a node which tries to populate its database as fast as possible if no other nodes on the network would ever try and ask it for that data? If a nodes radius is 100% the heuristic we use for finding data isn't that useful as the odds of it storing anything is nothing. So maybe a default radius of 50% is better? or name a number.

I think Jason has the right idea with this issue, but maybe this is like Occam's razor and the simplest solution is the best.

Idk I guess I gotta think more about the pro's and con's, but it doesn't seem like we are losing anything by starting with a smaller radius. I think this point is proven by the Content Replication, by Content ID Prefix graph on glados on how we ignore nodes with a large radius. The cons of 100% is tons of CPU time and storage cycle are used for nothing.

Anyway, I forgot what a lot of the graphs mean on Grafana. I should probably write descriptions when I remember so that when I forget, I can read them again.

But yeah it will be interesting how this problem gets solved

pipermerriam commented 19 hours ago

I'm dubious on using the CPU as the measuring tool for this... and I don't have as solid a mental model of how our client uses the CPU cores. Here's my 2 cents on how I'd suggest going about this...

First I think we need to measure and understand where the offending CPU usage is happening. What specifically in our code is using all of that CPU time. If we don't know what is using the time I don't think we can address the problem effectively.

After we know the culprit for high CPU usage, we should then be able to devise a strategy for effective rate limiting. Without that I think we're in the realm of "premature optimization".

With regards to this being a required feature for the release, I don't think it is... but I agree its something that we need to address in the short term. We aren't yet soliciting people to install and run this long term on their computers. At the point where that changes, I agree that we will need to have this dealt with.

ethereum / trin