EOSIO / eos

An open source smart contract platform
https://developers.eos.io/manuals/eos
MIT License
11.27k stars 3.6k forks source link

Nodeos crashes with segfault when there is a high load of accept_transaction executions in a custom plugin #6380

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hello,

I developed an EOS plugin, which acts as a tcp server that receives client requests, processes them, persists them in the blockchain by calling smart contracts which i created, and sends a response to the client. The plugin is started alongside nodeos with the --plugin option. The plugin then creates a thread (inside plugin_startup) which works as a tcp server waiting for clients to connect to a specified port and send their data. I tried by creating two versions of the plugin, in one version i created separate thread for every client request (concurrent writing into the blockchain from multiple threads), and in the other version 1 thread has been used for processing every connection(meaning sequential write into the blockchain from a single thread)

As part of my plugin, I use the following methods which i created by digging into the eos code, samples and tests to persist the data into the blockchain:

  1. I tested two versions of this method for setting the transaction headers: first version:

    void my_plugin::set_transaction_headers(chain::transaction& trx, uint32_t expiration, uint32_t delay_sec) { chain_plugin& cp = app().get_plugin(); controller& cc = cp.chain(); trx.expiration = cc.head_block_time() + fc::seconds(expiration); trx.max_net_usage_words = 0; trx.max_cpu_usage_ms = 0; trx.delay_sec = delay_sec; }`

and second version: (as seen in txn_test_gen_plugin)

void my_plugin::set_transaction_headers(chain::transaction& trx, uint32_t expiration, uint32_t delay_sec) { controller& cc = app().get_plugin().chain(); trx.expiration = cc.head_block_time() + fc::seconds(expiration); uint32_t reference_block_num = cc.last_irreversible_block_num(); block_id_type reference_block_id = cc.get_block_id_for_num(reference_block_num); trx.set_reference_block(reference_block_id); trx.max_net_usage_words = 100; trx.max_cpu_usage_ms = 0; static uint64_t nonce = static_cast(fc::time_point::now().sec_since_epoch()) << 32; trx.context_free_actions.emplace_back(action({}, config::null_account_name, "nonce", fc::raw::pack(nonce++))); }

  1. Then i set the action with trx.actions.emplace_back(...)

  2. I sign the transaction with this method:

    void my_plugin::sign_transaction(chain::signed_transaction& trx) { using namespace eosio::wallet; wallet_manager wm; wm.set_dir("/path-to-my-wallet/eosio-wallet"); wm.unlock("my_wallet", "wallet-password");
    chain_apis::read_only::get_account_params account_name{N(my_account)}; chain_plugin& cp = app().get_plugin(); controller& cc = cp.chain(); chain_apis::read_only reader(cc, abi_serializer_max_time); chain_apis::read_only::get_account_results account_result = reader.get_account(account_name); std::vector permissions = account_result.permissions; chain::private_key_type priv_key; for (const auto& p : permissions) { if (p.perm_name == "active") { public_key_type pub_key = p.required_auth.keys[0].key; priv_key = wm.list_keys("my_wallet", "wallet-password")[pub_key]; break; } } trx.sign(priv_key, cc.get_chain_id()); }

  3. And finally i execute the transaction with this method:

    void my_plugin::execute_transaction(chain::signed_transaction& trx) { chain_plugin& cp = app().get_plugin();
    cp.accept_transaction(chain::packed_transaction(trx), [=, &exc](const fc::static_variant<fc::exception_ptr, chain::transaction_trace_ptr>& result) {
    if (result.contains()) { const auto& e = result.get(); std::cerr << e->to_detail_string(); } else { std::cout << "success"; } });
    }

I have also created a separate client application which sends a request to the plugin (server) and receives a response that the data has been saved in the blockchain. When i invoke the client in a for loop with (usually) more then 2000 iterations, after some short period of time in most of the cases i get one of the following errors:

  1. segmentation fault without any visible error message (except in some rare cases where i get the following message prior to the segfault:)

warn 2018-11-21T13:16:23.001 thread-0 producer_plugin.cpp:1435 maybe_produce_block ] 8 out_of_range_exception: Out of Range write datastream of length 68391 over by 1 {"method":"write","len":68391,"over":1} thread-0 datastream.cpp:6 throw_datastream_range_error Segmentation fault (core dumped)

  1. If i don't encounter the segfault i usually get this error: error 2018-11-21T10:54:04.027 thread-0 producer_plugin.cpp:1330 scheduleproduction ] Failed to start a pending block, will try again later warn 2018-11-21T10:54:04.077 thread-0 producer_plugin.cpp:1076 start_block ] 3060000 database_exception: Database exception db revision is not on par with head block {"db.revision()":2859,"controller_head_block":2860,"fork_db_head_block":2860} thread-0 controller.cpp:1087 start_block
  2. The nodeos stops producing blocks (i stop seeing the 'Produced block...' log line) and it looks like it has frozen, i can't even shut it down with ctrl+c, i have to kill it with kill -9 signal.

In all of the cases above i can only restart nodeos by adding --delete-all-blocks (--hard-replay doesn't work).

I have the aforementioned problems in both plugin versions (the single threaded as well as the multithreaded). If i comment only the cp.accept_transaction(...) call in the execute_transaction(chain::signed_transaction& trx) method above, i have no problems at all, i can invoke the client in a for loop with even 10,000 iterations multiple times and the plugin works just fine (well except that it doesn't execute the prepared transaction).

Also, i created a separate test application which calls my smart contract directly with 'push transaction...'. (the same smart contract which i am calling by trying to execute the transaction from the plugin) The test application can call the push transaction commands in a for loop with thousands of iterations (i have tried up to 10000) in 2 ways also, (one is singlethreaded, and the other is multithreaded), both ways work as expected without any exceptions or segfaults, so the only way i am getting the segfaults or the aforementioned exceptions is by calling accept_transaction from my plugin.

I am synchronized with the following git commit: 59626f1e6361df3b715e926ee13a9a8e84d177af and i use single node testnet.

Since i couldn't find any tutorial on eos plugin development, I would like to know if this is some known bug, or i am missing something at my side, maybe using the wrong api, or using it inappropriately?

Thank you.

p.s. I forgot to add one significant error that also appears often at the segfault: from the coredump file analyzed with lldb:

info 2018-11-23T14:33:08.883 thread-0 chain_plugin.cpp:333 plugin_initialize ] initializing chain plugin warn 2018-11-23T14:33:08.893 thread-0 chain_plugin.cpp:684 plugin_initialize ] 13 St13runtime_error: database dirty flag set rethrow database dirty flag set: {"what":"database dirty flag set"} thread-0 chain_plugin.cpp:684 plugin_initialize Failed to initialize Process 9332 exited with status = 255 (0x000000ff)

from the lldb terminal with attached nodeos process id:

Process 8977 stopped thread #2, name = 'nodeos', stop reason = signal SIGSEGV: invalid address (fault address: 0x20) frame #0: 0x0000000000cfd7cd nodeoseosio::chain::transaction_context::transaction_context(eosio::chain::controller&, eosio::chain::signed_transaction const&, fc::sha256 const&, fc::time_point) + 653 nodeoseosio::chain::transaction_context::transaction_context: 0xcfd7cd <+653>: movl 0x20(%rax), %eax 0xcfd7d0 <+656>: movq (%rbp), %rcx 0xcfd7d4 <+660>: movl %eax, 0x20(%rcx) 0xcfd7d7 <+663>: movq 0xa0(%rsp), %rbx

brianjohnson5972 commented 5 years ago

I am thinking your issue is that you are calling accept_transaction on your own thread and not on the main thread that it is expected to be called on. accept_transaction does all of its processing on the thread that calls it, so all of your transactions are being performed on your thread and then any blocks you produce/receive are produced/applied on the main thread. So the reason that your application worked just fine when it was a separate process from nodeos was because it was allowing the http_plugin to handle the incoming transactions on the main thread. So to get this to work, you will need to do something similar to txn_test_gen_plugin and drive your plugin using an http_pluggin input and a timer or else you will need to create a queue and have that queue synchronized and service the queue on the main thread to call accept_transaction.

taokayan commented 5 years ago

Agree with @brianjohnson5972 , this is more likely to be threading issue. Try to do it on the main thread. If you have further question, please ask in https://eosio.stackexchange.com/

ghost commented 5 years ago

Thank you for your help and information provided