IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 718 forks source link

`cardano-sl` -> proxy connection timeouts #296

Closed deepfire closed 4 years ago

deepfire commented 4 years ago

In a mixed cluster setup, cardano-sl nodes periodically time out, while connecting to the proxy:

machine# [   66.753075] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [  346.770420] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [  626.746899] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [  906.743814] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 1186.769384] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 1466.776517] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 1766.915519] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 1967.662612] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 2187.014293] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException
machine# [ 2387.656346] sending MsgAnnounceBlockHeader OriginSender to NodeId 10.1.0.9:5555:0 failed with TransportError ConnectTimeout "Timed out" :: SomeException

This leads to loss of block propagation in the Legacy -> Rewrite direction, as can be seen from the block propagation chart:

deepfire@andromedae:~/cardano-node$ scripts/cluster-log.sh --limit 10 --cluster 10.1.0.%s 1 9 cluster.log headers-seen
ip\slot     0      1      2      3      4      5      6      7      8      9      10    
10.1.0.1    26.0   32.7   52.2   72.7   92.6   ---    132.3  152.7  172.5  ---    ---    232.5  ---    272.3 
10.1.0.2    25.7   32.7   51.8   72.1   92.5   111.8  132.5  152.3  172.5  191.8 
10.1.0.3    25.0   32.9   52.1   72.9   92.3   ---    132.6  152.5  172.6  ---    ---    232.3  ---    272.7 
10.1.0.4    24.4   32.5   52.3   72.6   92.5   ---    132.5  152.5  172.3  ---    ---    232.6  ---    272.5 
10.1.0.5    24.5   33.6   50.6   70.8   273.9  110.7  273.9  273.9  273.9  190.7 
10.1.0.6    23.9   33.7   50.5   70.4   273.8  110.7  273.8  273.8  273.8  190.6 
10.1.0.7    23.9   33.6   50.6   70.6   273.9  110.6  273.9  273.9  273.9  190.6 

10.1.0.9    23.5   34.4   51.7   71.9   ---    111.7  ---    ---    ---    191.7  212.7  ---    251.6  276.0  295.3 
deepfire commented 4 years ago

Looking at tcpdump -- logs filtered by egrep 'Flags (\[S\]|\[S.\]|\[F\.)' cluster.log, we can gleam the following sequence:

machine# [   53.428666] proxy-tcpdump[944]: 01:47:02.648093 IP proxy.cardano.personal-agent > proxy.cardano.45844: Flags [F.], seq 824, ack 2059, win 381, options [nop,nop,TS val 262494091 ecr 262494090], length 0
machine# [   53.454622] proxy-tcpdump[944]: 01:47:21.457368 IP proxy.cardano.45856 > proxy.cardano.personal-agent: Flags [S], seq 1608151934, win 43690, options [mss 65495,sackOK,TS val 262512900 ecr 0,nop,wscale 7], length 0
machine# [  273.523524] proxy-tcpdump[944]: 01:47:21.457399 IP proxy.cardano.personal-agent > proxy.cardano.45856: Flags [S.], seq 2369668399, ack 1608151935, win 43690, options [mss 65495,sackOK,TS val 262512900 ecr 262512900,nop,wscale 7], length 0
machine# [  273.581701] proxy-tcpdump[944]: 01:47:21.468660 IP proxy.cardano.personal-agent > proxy.cardano.45856: Flags [F.], seq 5, ack 32, win 342, options [nop,nop,TS val 262512911 ecr 262512911], length 0
  1. at 53.428666 an existing live connection is closed
  2. at 53.454622 another one is attempted, but never ever sees a reply TCP packet
  3. at 273.523524 another attempt occurs, and sees a reply -- that correlates with the slot 4 propagation times on the Rewrite side
deepfire commented 4 years ago

Ok, an interesting thing to note:

  1. The Legacy node never retries its initial TCP SYN packet -- which is not normal configuration, as the SYN retries are supposed to happen (with a backoff on Linux) -- let's see what's in /proc/sys/net/ipv4/tcp_syn_retries.

This gives rise to a hypothesis:

  1. The proxy isn't too quick to re-establish its listening port,
  2. The TCP RST packet isn't sent by the proxy for some reason f.e. it's filtered out,
  3. The Legacy node is configured not to retry the TCP SYN.

UPDATE: /proc/sys/net/ipv4/tcp_syn_retries is 6, so -- retries should happen, but don't.

Weird!

deepfire commented 4 years ago

Ok, after eliminating VDE from the picture, and re-seating everything over plain old loopback:

  1. The "missing" packets are no longer missing -- all connections are properly initiated and closed at the TCP level. I guess the routing was just non-deterministic with all those interfaces.
  2. The connection timeout exception is still there.
deepfire commented 4 years ago

So the exception is either:

Pushed https://github.com/deepfire/network-transport-tcp/commit/b92cfe7c581a60c599ff40e14e417fa1eb8422d4 to disambiguate.

UPDATE:

Ok, it's https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1583

..and adding HasCallStack further provides:

machine# [   65.639138] na00m9p0s4n1vpadjmhg1jppqfh1iz5i-unit-script-cardano-node-legacy_-start[852]: CallStack (from HasCallStack):
machine# [   65.646007] na00m9p0s4n1vpadjmhg1jppqfh1iz5i-unit-script-cardano-node-legacy_-start[852]:   readMVarTimeout, called at src/Network/Transport/TCP.hs:1916:19 in network-transport-tcp-0.6.0-92TSEsn6W8qLlAnYBxd8Bu:Network.Transport.TCP

..that is https://github.com/deepfire/network-transport-tcp/blob/8c6cc9f99e3896a3e2df6be6d99f5cc4c8b395a5/src/Network/Transport/TCP.hs#L1916

deepfire commented 4 years ago

@avieth provided some context:

  1. If we know that the resolved MVar is what causes the timeout (https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1551) -- (which I established previously)
  2. Then on the connection initiator side, it's the setupRemoteEndpoint that toggles that MVar -- in https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1256

So the next logical step for me was to trace down, where exactly in setupRemoteEndpoint are we getting stuck.

A log obtained with https://github.com/deepfire/network-transport-tcp/blob/f99a715cdfe63e0bb59818840f60af75ce8c8290/src/Network/Transport/TCP.hs#L2081 -- seems to suggest that socketToEndPoint progresses completely:

achine# [   92.814698] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [   92.816647] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [   92.818643] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [   92.850125] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.1":"3001"
machine# [   92.852109] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.4":"3004"
machine# [   92.854131] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.9":"5555"
machine# [   92.862492] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.1":"3001"
machine# [   92.864423] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.4":"3004"
machine# [   92.868066] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.9":"5555"
machine# [   93.177586] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.1":"3001"
machine# [   93.179541] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.4":"3004"
machine# [   93.183533] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.9":"5555"
machine# [   93.188808] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.9":"5555"
machine# [   93.190606] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.node:Debug:ThreadId 139] [2001-09-09 01:48:02.27 UTC] 7: <- socketToEndpoint 127.1.0.9:5555:0
machine# [   93.447877] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.1":"3001"
machine# [   93.449732] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.node:Debug:ThreadId 139] [2001-09-09 01:48:02.28 UTC] Waiting on the block queue or recovery header var7: <- socketToEndpoint 127.1.0.1:3001:0
machine# [   93.455783] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.4":"3004"
machine# [   93.457690] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.diffusion.outboundqueue.c-a-2:Debug:ThreadId 928] [2001-09-09 01:48:02.28 UTC] sending MsgAnnounceBlockHeader OriginSender to NodeId 127.1.0.4:3004:07: <- socketToEndpoint 127.1.0.4:3004:0
machine# [  107.469733] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.diffusion.outboundqueue.c-a-2:Warning:ThreadId 932] [2001-09-09 01:48:17.28 UTC] sending MsgAnnounceBlockHeader OriginSender to NodeId 127.1.0.9:5555:0 failed with TransportError ConnectTimeout "findRemoteEndPoint:  timed out" :: SomeException

(The above obtained using scripts/cluster-log.sh --limit 40 --ip 127.1.0.2 cluster.log cat | grep 'connect\|socketToEndpoint\|recvWord32\|timed out' | less)

So that log seems insufficient, we need more.

deepfire commented 4 years ago

Reproduction details:

deepfire commented 4 years ago

The topology used on the Byron side is: https://github.com/input-output-hk/cardano-node/blob/serge/mainnet-ci/configuration/mainnet-ci/topology-mainnet-ci.yaml -- it's set in https://github.com/input-output-hk/cardano-node/blob/serge/mainnet-ci/nix/nixos/cardano-cluster-service.nix#L37

With that, we have the following block propagation -- and you can see from the log from the previous message, that the timeout exception happens at the 107'th second:

deepfire@andromedae:~/cardano-node$ scripts/cluster-log.sh --limit 10 --cluster 10.1.0.%s 1 9 cluster.log headers-seen
ip\slot     0      1      2      3      4      5      6      7      8      9      10    
10.1.0.1    26.0   32.7   52.2   72.7   92.6   ---    132.3  152.7  172.5  ---    ---    232.5  ---    272.3 
10.1.0.2    25.7   32.7   51.8   72.1   92.5   111.8  132.5  152.3  172.5  191.8 
10.1.0.3    25.0   32.9   52.1   72.9   92.3   ---    132.6  152.5  172.6  ---    ---    232.3  ---    272.7 
10.1.0.4    24.4   32.5   52.3   72.6   92.5   ---    132.5  152.5  172.3  ---    ---    232.6  ---    272.5 
10.1.0.5    24.5   33.6   50.6   70.8   273.9  110.7  273.9  273.9  273.9  190.7 
10.1.0.6    23.9   33.7   50.5   70.4   273.8  110.7  273.8  273.8  273.8  190.6 
10.1.0.7    23.9   33.6   50.6   70.6   273.9  110.6  273.9  273.9  273.9  190.6 

10.1.0.9    23.5   34.4   51.7   71.9   ---    111.7  ---    ---    ---    191.7  212.7  ---    251.6  276.0  295.3 
avieth commented 4 years ago

I believe I've figured out why this is happening. How best to fix it is up for debate.

Spoiler: it's caused by the fact that multiple endpoint addresses can identify the same endpoint (because of DNS names). One way to fix it is to remove the --address CLI argument from cardano-byron-proxy.

The SL node c-a-2 and the Byron proxy proxy form the single link between the Shelley and Byron clusters. i.e. the only Byron node which communicates with proxy is c-a-2. We noticed that a network-transport-tcp timeout would occur sometimes. On closer inspection, it would occur only at c-a-2 (the initiator) when contacting proxy. And further, it would occur only when proxy was also contacting c-a-2 around that same time.

Two notes about the design of network-transport-tcp:

When c-a-2 and proxy each get a new block around the same time, they will announce its header to one-another. Since there are no long-running network-transport-tcp lightweight connections (like a subscription) between these two, this brings up new heavyweight connections (TCP sockets) which can and sometimes do become crossed. In normal network-transport-tcp operation this is no problem: the protocol will work out which socket should stay and which should go.

However in this case we have proxy claiming proxy.cardano:5555:0 as its endpoint address, c-a-2 claiming 127.1.0.2:3002, and c-a-2 connecting not to proxy.cardano:5555:0 but to the numeric address it names 127.1.0.9:5555:0!

proxy will make the connection no problem. It will create its own socket, and when it sees the crossed connection from 127.1.0.2:3002:0, it will check that indeed "proxy.cardano:5555:0" > "127.1.0.2:3002:0" and therefore will not respond on that socket (it will use the one that it created earlier).

On the other hand, c-a-2 does not connect to proxy.cardano:5555:0, but rather to its true identity 127.1.0.9:5555:0! Crucially: it does not judge this to be a crossed connection, because it doesn't know that these strings identify the same endpoint. And so it times out waiting for a response, because proxy throws the socket away by the crossed connection rule.

avieth commented 4 years ago

@deepfire I recommend never disabling the peer host check. Instead, use this revision of network-transport-tcp, which binds connecting sockets to <IPv4host>:0. https://github.com/avieth/network-transport-tcp/commit/46f2942423d9ad3c81040565a9e9d5b9e08ddcc4

deepfire commented 4 years ago

The issue was happily resolved.