MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.97k stars 522 forks source link

rpc_parallel workers fail to start, leading to ugly daemon crash #6001

Open lk86 opened 3 years ago

lk86 commented 3 years ago

1) Figure out why UPnP is not opening ports 2) Make sure rpc messages in whatever form this is no longer cause breakages/node crashes

O1ahmad commented 3 years ago

Possible to attach trace/debug logs leading upto crash? (..also looking into repro'ing and collecting more error details)

lk86 commented 3 years ago

Here's the fatal crash, if you need more complete logs i'm sure you can mention the "EOF or connection closed" error in #mentor-nodes and you'll have plenty of volunteers who can reproduce it (including izzy on his laptop). {"timestamp":"2020-09-17 04:00:58.539481Z","level":"Fatal","source":{"module":"Init__Coda_run","location":"File \"src/app/cli/src/init/coda_run.ml\", line 575, characters 2-26"},"message":"Unhandled top-level exception: $exn\nGenerating crash report","metadata":{"exn":"(monitor.ml.Error\n ((rpc_error (Connection_closed (\"EOF or connection closed\")))\n (connection_description\n (\"Client connected via TCP\" (vmi446914.contaboserver.net 35713)))\n (rpc_tag worker_init_rpc_1) (rpc_version 0))\n (\"Raised at file \\\"src/error.ml\\\" (inlined), line 9, characters 14-30\"\n \"Called from file \\\"src/or_error.ml\\\", line 72, characters 17-32\"\n \"Called from file \\\"src/deferred1.ml\\\", line 17, characters 40-45\"\n \"Called from file \\\"src/job_queue.ml\\\" (inlined), line 131, characters 2-5\"\n \"Called from file \\\"src/job_queue.ml\\\", line 171, characters 6-47\"))","pid":27233}}

O1ahmad commented 3 years ago

Cool, thanks and no doubt. Really trying to see how much debug/trace output we can get for visibility++.

But yea mentioned mostly for tracking purposes to go along with this issue. Also, seems easier to reference here.

jspada commented 3 years ago

This is crash is impacting me also. Have not been able to join.

2020-09-18 10:50:53 UTC [Info] Ledger file $path does not exist
        path: "/home/x/.coda-config/genesis_ledger_accounts_e7fd308899bca6956b093b635988ea06630adbf540e13d38dcd2f7ab629da53c_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz"
2020-09-18 10:50:55 UTC [Info] Could not download genesis ledger from $uri: $error
        uri: "https://s3-us-west-2.amazonaws.com/snark-keys.o1test.net/genesis_ledger_accounts_e7fd308899bca6956b093b635988ea06630adbf540e13d38dcd2f7ab629da53c_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz"
        error: "(monitor.ml.Error\n (\"Process.run failed\"\n  ((prog curl)\n   (args\n    (--fail -o\n     /tmp/s3_cache_dir/genesis_ledger_accounts_e7fd308899bca6956b093b635988ea06630adbf540e13d38dcd2f7ab629da53c_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz\n     https://s3-us-west-2.amazonaws.com/snark-keys.o1test.net/genesis_ledger_accounts_e7fd308899bca6956b093b635988ea06630adbf540e13d38dcd2f7ab629da53c_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz))\n   (exit_status (Exit_non_zero 22)) (stdout \"\")\n   (stderr\n    (\"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\"\n     \"                                 Dload  Upload   Total   Spent    Left  Speed\"\n     \"\\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0\\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0\"\n     \"curl: (22) The requested URL returned error: 404 Not Found\" \"\"))))\n (\"Raised at file \\\"src/error.ml\\\" (inlined), line 9, characters 14-30\"\n  \"Called from file \\\"src/or_error.ml\\\", line 72, characters 17-32\"\n  \"Called from file \\\"src/deferred1.ml\\\", line 17, characters 40-45\"\n  \"Called from file \\\"src/job_queue.ml\\\" (inlined), line 131, characters 2-5\"\n  \"Called from file \\\"src/job_queue.ml\\\", line 171, characters 6-47\"))"
2020-09-18 10:50:58 UTC [Info] Creating genesis ledger tar file for $root_hash at $path from database at $dir
        root_hash: "3NLDMBjXomomCVeMu54X8qXYZDuCzKM4QJaezgdUX57yHWqNgjmM"
        path: "/home/x/.coda-config/genesis_ledger_f1c1b1bae67e1ec8ffdf91121ec34c7c97933fd86b727ebd8a5421b7f12a8ece.tar.gz"
        dir: "/tmp/coda_cache_dir/48cb26dd-bd7a-3245-24f1-e08d2406e121"
2020-09-18 10:50:58 UTC [Info] Linking ledger file $tar_path to $named_tar_path
        tar_path: "/home/x/.coda-config/genesis_ledger_f1c1b1bae67e1ec8ffdf91121ec34c7c97933fd86b727ebd8a5421b7f12a8ece.tar.gz"
        named_tar_path: "/home/x/.coda-config/genesis_ledger_release_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz"
2020-09-18 10:50:59 UTC [Info] Genesis proof file $path does not exist
        path: "/home/x/.coda-config/genesis_proof_4e3894c8dc5e2d9a6485cb766d38ab2006e71977175e18abfffceecefd992c74"
2020-09-18 10:50:59 UTC [Info] Found genesis proof file at $path
        path: "/var/lib/coda/genesis_proof_4e3894c8dc5e2d9a6485cb766d38ab2006e71977175e18abfffceecefd992c74"
2020-09-18 10:50:59 UTC [Info] Loaded ledger from $ledger_file and genesis proof from $proof_file
        ledger_file: "/home/x/.coda-config/genesis_ledger_release_f094d0eb814dbbff2f41f3f69c414331566f2c4a884c309944899ea59f101b9a.tar.gz"
        proof_file: "/var/lib/coda/genesis_proof_4e3894c8dc5e2d9a6485cb766d38ab2006e71977175e18abfffceecefd992c74"
2020-09-18 10:51:00 UTC [Info] Setting current protocol version to "0.1.0" from compile-time config
2020-09-18 10:52:06 UTC [Fatal] Unhandled top-level exception: $exn
Generating crash report
        exn: "(monitor.ml.Error\n  ((rpc_error (Connection_closed (\"EOF or connection closed\")))\n    (connection_description (\"Client connected via TCP\" (kaga 45039)))\n    (rpc_tag worker_init_rpc_1) (rpc_version 0))\n  (\"Raised at file \\\"src/error.ml\\\" (inlined), line 9, characters 14-30\"\n    \"Called from file \\\"src/or_error.ml\\\", line 72, characters 17-32\"\n    \"Called from file \\\"src/deferred1.ml\\\", line 17, characters 40-45\"\n    \"Called from file \\\"src/job_queue.ml\\\" (inlined), line 131, characters 2-5\"\n    \"Called from file \\\"src/job_queue.ml\\\", line 171, characters 6-47\"))"

  ☠  Coda Daemon crashed.
   The Coda Protocol developers would like to know why!

I've tried with my firewall completely disabled (allowing incoming connections on all ports) and I've tried with and without the -external-ip option.

I've tried with Ubuntu 18.04 LTS and 20.04 LTS.

My machine has 12 cores and 16GB ram.

Very occasionally it doesn't crash but just hangs indefinitely after "Setting current protocol version to "0.1.0" from compile-time config"

In case it's relevant

ii  libminiupnpc17:amd64 2.1.20190824-0ubuntu2 amd64        UPnP IGD client lightweight library

Please see crash report attached.

Hope this helps.

coda_crash_report_2020-09-18_10-52-06.559671.tar.gz

jason-james commented 3 years ago

I had this issue also, tried all of the things that @jspada tried but none of them worked. This was on Ubuntu 18.04 on a cloud server. I then tried on Ubuntu 18.04 on a local VM via hyper-V on my desktop, and got a little further but crashed probably due to memory issues.

After switching to Debian 9 on a cloud server from the same provider above, it's working fine. Running without error for ~14 hours.

8 cores 32gb ram

emberian commented 3 years ago

@lk86 this is unrelated to upnp or the libp2p networking. the exception is being raised by the rpc_parallel stack that communicates with the prover/verifier.

emberian commented 3 years ago

The fact that switcing from ubuntu to debian is an effective workaround leads me to believe that this is related to library linkage, waiting on more logs from a testnet participant to confirm.

dheeraj07 commented 3 years ago

I had the same issue several times today and still facing this. Machine type: 8vCPU and 32gb ram in Ubuntu 18.04(Cloud VM)

bkase commented 3 years ago

the exception is being raised by the rpc_parallel stack that communicates with the prover/verifier.

To follow up on what @emberian is saying, they also confirmed that the prover/verifier are currently the only places in our codebase using rpc_parallel and so it must be one of these two that is causing this failure.