bosagora / agora

POC Node implementation for CoinNet
https://bosagora.io
MIT License
37 stars 23 forks source link

The process of getting missing block sigs is stalled #3269

Closed linked0 closed 2 years ago

linked0 commented 2 years ago

This error happens when I run a validator in the local machine or AWS instance.

The process of getting missing block sigs gets stalled in the second catch-up task after starting a validator from scratch, which means the first catch-up task succeeds and the second catch-up task never ends with the following log messages which are repetitive.

2022-04-07 15:47:48,441 Warn [agora.network.Manager] - Could not find mapping in registry for key boa1xrjwgpgnpjuxzf452tja7ha3qhm83vz8hyq9kse8y5unewqtgkp2kptgvt9
2022-04-07 15:47:48,441 Warn [agora.network.Manager] - Could not find mapping in registry for key boa1xqp3sqa27jsygxpmt6ekpm393qjrf00whfadl40wauxrutg236yay3q48g0
2022-04-07 15:47:48,442 Warn [agora.network.Manager] - Could not find mapping in registry for key boa1xz3qed5979y0v0mjjsndh5lcjmyvp9hrcsu78qs4mjw8gugqnkxx25e3qfd
2022-04-07 15:47:49,248 Warn [agora.network.Manager] - Could not find mapping in registry for key boa1xrmjg8ezuw0qmzu9gd0aarqd9d2ad43v42m7vxwzuaczg54m93v6qnvcxse
2022-04-07 15:47:53,447 Info [agora.network.Manager] - Connection task limit reached (10/10). Will try again in 5 secs. 13 addresses in queue.
2022-04-07 15:47:53,447 Info [agora.network.Manager] - Pending connections: [agora://na-003.bosagora.io/, agora://eu-002.bosagora.io:3826/, agora://eu-002.bosagora.io/, agora://na-001.bosagora.io:4826/, agora://eu-003.bosagora.io:4826/, agora://na-002.bosagora.io:4826/, agora://na-002.bosagora.io:3826/, agora://na-003.bosagora.io:4826/, agora://na-002.bosagora.io/, agora://0.tcp.ngrok.io:19150/] - Waiting: ["agora://na-001.bosagora.io:3826/", "agora://na-004.bosagora.io:5826/", "agora://eu-003.bosagora.io:3826/", "agora://eu-002.bosagora.io:4826/", "agora://na-004.bosagora.io:3826/", "agora://na-003.bosagora.io:3826/", "agora://eu-004.bosagora.io:5826/", "agora://eu-004.bosagora.io:3826/", "agora://eu-005.bosagora.io:5826/", "agora://eu-005.bosagora.io:4826/", "agora://na-001.bosagora.io/", "agora://na-001.bosagora.io:5826/", "agora://eu-003.bosagora.io:5826/"]
linked0 commented 2 years ago

We call the getBlockHeaders API on peers while catching up on blocks, which means that a NetworkClient calls the API on each connection that is in fact an RPCClient instance. There happens a deadlock in this line of RCPClient while locking a connection on its connection poll.

omerfirmak commented 2 years ago

I have been working on a RPC rework, which got rid of the connection pool. Lets see if this still happens after we merge that

linked0 commented 2 years ago

OK, thanks. I think that the deadlock happens always at the 4th trial on a connection, which is RPCConfig.concurrency + 1.

linked0 commented 2 years ago

The PR could solve this issue. The blocks aren't generated in the TestNet now. Shouldn't we restart the Validators?

linked0 commented 2 years ago

Solved by PR #3276