EOSIO / eos

An open source smart contract platform
https://developers.eos.io/manuals/eos
MIT License
11.28k stars 3.6k forks source link

Reliability: Error 3080006: Transaction took too long #10827

Open n13 opened 2 years ago

n13 commented 2 years ago

This bug has been discussed here a few times [ #4077, #9677 ]

Our dapp is affected by this. Our app is hard to use because of it. And there is no solution other than on the EOSIO implementation, cache, and timer.

Developers have argued this is not a bug - the cache needs to be filled, this takes time, and this also exceeds the node total tx time limit.

So I feel this is either

  1. A bug in the node cache timer. We can see that the CPU timer is actually stopped while the caches are loading. See the code here ---> wasm_interface_private.hpp, line 58

However, that pause does not seem to affect the node's 30ms timeout. I assume the timer that is paused there is a different timer from the one used for the 30ms timeout. Then, it fails on the first try, and succeeds on the second try.

This is clearly a failure of EOSIO timekeeping. It's not that the action takes too long - our action executes inside 5ms every time.

  1. A bug in the cache implementation. If the cache loading is actually truly part of the block time, and needs to happen inside 500ms, then that can't work. But the caches can be loaded totally independently of block time. They can be loaded in parallel, and actions can be executed once caches are ready. This is a matter of node code, not the chain. I doubt this is the reason though - the above code snippet shows the developers intent to keep the timer out of the cache loading.

Summary

The chain keeps track of time for billing

The node implementation mixes this timekeeping up with cache loading, which actually has nothing to do with the chain, and its timekeeping. The chain doesn't care if there's a cache, or if everything is kept in RAM, etc. So a good cache implementation would keep the time it takes to load the cache completely away from the chain and its time-keeping functions.

Steps to reproduce

  1. Run an action on a given node for the first time
  2. Run the same action again

Compare the time it took to execute the action between 1 and 2

Expected

The time to execute a given blockchain function given similar data and similar input is about the same - within 1 order of magnitude

Actual

The time to execute for the first action is significantly higher than the subsequent executions. A perfectly valid, efficient action may not execute if run for the first time since it takes too long to fill the caches. This makes the blockchain highly unreliable for submitting actions.

heifner commented 2 years ago

Thanks for this nice writeup of your issue. Do you have example run times for 1, 2? Is this on a public network or private? Are you interacting with a local nodeos? If running a local nodeos please provide version, config.ini, and command line arguments. Particularly interested in database-map-mode, wasm-runtime, and any oc settings.

n13 commented 2 years ago

It's on the Telos network

(1) Caused an error Error: deadline exceeded 30217us

https://github.com/hypha-dao/dao-contracts/issues/153

(2) is the same action can be seen here, working - I have observed it taking between 4.5 and 6.5 ms.

https://telos.bloks.io/transaction/97bac9a47818877814c6ce517888999230a28da16533fcb72808b433667d0b14

PPS: Actually nevermind it is the same error Error 3080006: Transaction took too long Error Details: deadline exceeded 30217us pending console output:

sanaraufx commented 2 years ago

Are you using eosvm interpreter or eosvm jit? This will be present in either your nodeos command line or the config.ini? Can you please both those details?

n13 commented 2 years ago

Are you using eosvm interpreter or eosvm jit? This will be present in either your nodeos command line or the config.ini? Can you please both those details?

I am not running a node, just observing that this bug affects our contract execution and dapp.

Therefore, I would like to submit it to the EOS teams so that EOS can improve this issue, and have functioning dapps.

The problem with this issue is it would affect every dapp on any EOSIO chain, and it would cause a chance of a timeout on the first time any action is executed, irrespective of how long the action actually takes once caches are filled.

How long an action takes to execute on a node is something a dapp developer can influence

Cache timeouts - a dapp doesn't have a choice on this.

A workaround would be to try all actions twice, the first time to warm up the cache, the second to run the action. Seems really inefficient and clunky.

sanaraufx commented 2 years ago

If you are depending on a public EOSIO endpoint for your Dapp, it might not be ideal due to the varying CPU capabilities of those nodes so you will get varying results for your transaction duration when you try different endpoints. A better approach that many developers use is to run a local node with eosvm jitter (and OC).

spoonincode commented 2 years ago

It sounds like you're encountering a long known deficiency in nodeos' implementation that has existed since 1.0. Nearly all implemented WASM runtimes past and present (binaryen, wabt, WAVM, EOS VM, EOS VM JIT) must re-compile the contract code when encountering a contract action after nodeos is restarted. As you've discovered, the time spent doing this compile is not billed, but the transaction deadline is still enforced. An action that takes >30ms (or whatever node is configured with) to compile the contract will always have its first action rejected (if it is subjectively executing). Large contracts on slow hardware can easily bump in to this deadline.

EOS VM OC is designed to not have this limitation. The compiled contract code is stored on disk ready to be used immediately on nodeos restart.

My recommendation is to run your own node with EOS VM OC enabled and submit transactions to it. This will ensure your own node never rejects a transaction due to contract compilation on nodeos restarts. You should then have a handful of peers connected to this node to ensure that the transaction is forwarded along to the BPs, as it's possible one of those nodes between your node and BPs has recently been restarted without OC and will reject the transaction.