why do failed/stale price-feed updates take so long?

aj-agoric commented 4 months ago

What is the Problem?

Just after a chain upgrade (while the chain has been halted for many minutes), we've noticed a big burst of price-feed activity, enough to slow the newly-restarted chain down significantly.

We know that price-feed txn execution takes a long time, which makes the chain slower, which makes the update more likely to be stale, which can increase the chance that we get more price updates, which creates a feedback loop. We can end up spending most of our time doing useless work.

We think part of the problem is that the (external) oracles aren't backing off when their txns fail, perhaps their code isn't watching for the txn submission to succeed before sending in the next update. We'd like them to avoid having more than one or two updates in the pipeline.

A second part is that a failing execution seems to take longer than a successful one. That seems backwards: when the stale timestamp is detected, it ought to short-circuit some amount of work, so failing updates really ought to take less time than successful ones.

The task here is to analyze the blocks that execute price-feed updates, both successful and failing, and see if we can figure out what happens differently. Some subtasks:

I'll update my "classify-runs" tool to distinguish between a successful update and a failing one
- then we'll collect computron/crank/wallclock-seconds from both types and draw some scatter plots, to see if there's a clear slowdown on either side
I'll also make a chart where the x-axis is block number, and y-axes have 1: number of successful price-feed events, 2: number of failed price-feed events, 3: inter-block time
- I'm expecting to see a wave of slowdown after the events, but I'm also curious to know the number of failed events in each window of time, over time: do we mostly see successful ones? are there bursts of failures?
look carefully at the slog traces of successful-vs-failing and enumerate the extra work being done, and understand why

The overall goal is to make price-feed updates faster, but this particular ticket will focus on making failing updates faster, or at least not slower than successful ones. If there's something pathological we're doing upon failure, maybe we can stop doing that.

The outcome may be changes to the price-feed or scaled-price-authority contracts. If so, deployment will require contract upgrades, like the kind we took out of upgrade16 and are now planned for a separate core-eval deployment.

toliaqat commented 3 months ago

@rabi-siddique, could you write a test to identify the source of this problem and eliminate any code sections from suspicion?

warner commented 3 months ago

Note: I'm no longer certain that failed updates take significantly longer than successful ones. I'm trying to collect enough data to prove/disprove that hypothesis, but for now, don't take its truth for granted.

rabi-siddique commented 3 months ago

@warner

Could you point me to the specific sections of the codebase where the update process is handled? Which files or modules should I focus on to inspect the issue?
Could you share the steps or scenarios that you've used so far to reproduce the issue on your end?

Agoric / agoric-sdk

why do failed/stale price-feed updates take so long? #9714

What is the Problem?