ethereum / trinity

The Trinity client for the Ethereum network
https://trinity.ethereum.org
MIT License
472 stars 145 forks source link

`ethstats` plugin does not gracefully recover from a remote server restarting #68

Closed veox closed 5 years ago

veox commented 6 years ago

What is wrong?

On remote netstats (ethstats server) restarting, a trinity node (at least when run in LES light-client mode) that's been previously connected to it no longer appears in the list.

Run as (via systemd):

trinity --discv5 --network-id=3 --sync-mode=light --log-level=info \
    --ethstats \
    --ethstats-server-url="wss://ropsten-stats.parity.io/api" \
    --ethstats-server-secret="SECRET" \
    --ethstats-node-id="veox-trinity-dev-lightclient" \
    --ethstats-node-contact="veox"

INFO-level trace (for some reason, it's printed twice):

   ERROR  10-18 00:44:11        EthstatsClient  Unexpected error in <trinity.plugins.builtin.ethstats.ethstats_client.EthstatsClient object at 0x7fcf67305668>, exiting                                                                        
Traceback (most recent call last):                                                                        
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 528, in transfer_data
    msg = yield from self.read_message()                                                                 
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 580, in read_message
    frame = yield from self.read_data_frame(max_size=self.max_size)                                                       
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 645, in read_data_frame
    frame = yield from self.read_frame(max_size)                                                                                                                                                                                              
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 710, in read_frame
    extensions=self.extensions,                                                                                           
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/framing.py", line 100, in read
    data = yield from reader(2)                                                                                                                                                                                                               
  File "/usr/lib64/python3.7/asyncio/streams.py", line 677, in readexactly                                
    raise IncompleteReadError(incomplete, n)                                                                              
asyncio.streams.IncompleteReadError: 0 bytes read on a total of 2 expected bytes                          
The above exception was the direct cause of the following exception:                                  
Traceback (most recent call last):                                                                                        
  File "/home/veox/src/py-evm/p2p/service.py", line 103, in run                                           
    await self._run()                                                                                    
  File "/home/veox/src/py-evm/trinity/plugins/builtin/ethstats/ethstats_client.py", line 52, in _run     
    self.recv_handler(),                                                                                                                                                                                                                      
  File "/home/veox/src/py-evm/p2p/cancellable.py", line 41, in wait_first                                 
    return await token_chain.cancellable_wait(*awaitables, timeout=timeout)                           
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/cancel_token/token.py", line 152, in cancellable_wait
    return done.pop().result()                                                                                                                                                                                                                
  File "/home/veox/src/py-evm/trinity/plugins/builtin/ethstats/ethstats_client.py", line 58, in recv_handler
    json_string: str = await self.websocket.recv()                                                       
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 350, in recv   
    yield from self.ensure_open()                                                                                         
  File "/home/veox/src/py-evm/.virtualenv/py-evm/lib/python3.7/site-packages/websockets/protocol.py", line 501, in ensure_open                                                                                                                
    self.close_code, self.close_reason) from self.transfer_data_exc                                       
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), no reason

How can it be fixed

Haven't investigated.

pipermerriam commented 6 years ago

Without digging in at all, this looks like we need graceful reconnection upon losing the websocket connection (with some sort of backstop to limit how many attempts to reconnect will be made).

carver commented 6 years ago

with some sort of backstop to limit how many attempts to reconnect will be made

Or maybe an exponential increase in time between reconnect attempts, up to some maximum, like once every 24 hours. (which means the operator doesn't have to reboot if the stats server has been offline for a long time)

pipermerriam commented 6 years ago

@carver good call on not hard failing. Didn't think about the fact that an ethstats outage could then take all of the connected trinity nodes offline which is something we do not want.

cburgdorf commented 5 years ago

It looks like the EthstatsService does have code that is supposed to deal with lost connections and reconnect.

https://github.com/ethereum/trinity/blob/a20dce4ad65c6c973040cf62aafcca924aa1588c/trinity/plugins/builtin/ethstats/ethstats_service.py#L68-L89

However, it doesn't seem as if that's working as expected. I noticed this exception and the service not recovering (reconnecting) after it.

   ERROR  09-14 04:42:49        EthstatsClient  Unexpected error in <trinity.plugins.builtin.ethstats.ethstats_client.EthstatsClient object at 0x7f909ac8ecd0>, exiting
Traceback (most recent call last):
  File "/usr/src/app/trinity/p2p/service.py", line 118, in run
    await self._run()
  File "/usr/src/app/trinity/trinity/plugins/builtin/ethstats/ethstats_client.py", line 52, in _run
    self.recv_handler(),
  File "/usr/src/app/trinity/p2p/cancellable.py", line 43, in wait_first
    return await token_chain.cancellable_wait(*awaitables, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/cancel_token/token.py", line 178, in cancellable_wait
    return done.pop().result()
  File "/usr/src/app/trinity/trinity/plugins/builtin/ethstats/ethstats_client.py", line 58, in recv_handler
    json_string: str = await self.websocket.recv()
  File "/usr/local/lib/python3.7/site-packages/websockets/protocol.py", line 352, in recv
    yield from self.ensure_open()
  File "/usr/local/lib/python3.7/site-packages/websockets/protocol.py", line 514, in ensure_open
    self.close_code, self.close_reason) from self.transfer_data_exc
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1000 (OK), no reason
   DEBUG  09-14 04:42:49        EthstatsClient  <trinity.plugins.builtin.ethstats.ethstats_client.EthstatsClient object at 0x7f909ac8ecd0> halted cleanly

I haven't investigated but it looks like the exception doesn't reach the except that is supposed to catch it.

pipermerriam commented 5 years ago

I haven't investigated but it looks like the exception doesn't reach the except that is supposed to catch it.

Just an interesting tid-bit. The new TrioService would have worked correctly as the Manager.run() correctly raises service exceptions when it terminates.