Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
328 stars 207 forks source link

why do failed/stale price-feed updates take so long? #9714

Open aj-agoric opened 4 months ago

aj-agoric commented 4 months ago

What is the Problem?

Just after a chain upgrade (while the chain has been halted for many minutes), we've noticed a big burst of price-feed activity, enough to slow the newly-restarted chain down significantly.

We know that price-feed txn execution takes a long time, which makes the chain slower, which makes the update more likely to be stale, which can increase the chance that we get more price updates, which creates a feedback loop. We can end up spending most of our time doing useless work.

We think part of the problem is that the (external) oracles aren't backing off when their txns fail, perhaps their code isn't watching for the txn submission to succeed before sending in the next update. We'd like them to avoid having more than one or two updates in the pipeline.

A second part is that a failing execution seems to take longer than a successful one. That seems backwards: when the stale timestamp is detected, it ought to short-circuit some amount of work, so failing updates really ought to take less time than successful ones.

The task here is to analyze the blocks that execute price-feed updates, both successful and failing, and see if we can figure out what happens differently. Some subtasks:

The overall goal is to make price-feed updates faster, but this particular ticket will focus on making failing updates faster, or at least not slower than successful ones. If there's something pathological we're doing upon failure, maybe we can stop doing that.

The outcome may be changes to the price-feed or scaled-price-authority contracts. If so, deployment will require contract upgrades, like the kind we took out of upgrade16 and are now planned for a separate core-eval deployment.

toliaqat commented 3 months ago

@rabi-siddique, could you write a test to identify the source of this problem and eliminate any code sections from suspicion?

warner commented 3 months ago

Note: I'm no longer certain that failed updates take significantly longer than successful ones. I'm trying to collect enough data to prove/disprove that hypothesis, but for now, don't take its truth for granted.

rabi-siddique commented 3 months ago

@warner