Closed deepfire closed 4 years ago
Looking at tcpdump
-- logs filtered by egrep 'Flags (\[S\]|\[S.\]|\[F\.)' cluster.log
, we can gleam the following sequence:
machine# [ 53.428666] proxy-tcpdump[944]: 01:47:02.648093 IP proxy.cardano.personal-agent > proxy.cardano.45844: Flags [F.], seq 824, ack 2059, win 381, options [nop,nop,TS val 262494091 ecr 262494090], length 0
machine# [ 53.454622] proxy-tcpdump[944]: 01:47:21.457368 IP proxy.cardano.45856 > proxy.cardano.personal-agent: Flags [S], seq 1608151934, win 43690, options [mss 65495,sackOK,TS val 262512900 ecr 0,nop,wscale 7], length 0
machine# [ 273.523524] proxy-tcpdump[944]: 01:47:21.457399 IP proxy.cardano.personal-agent > proxy.cardano.45856: Flags [S.], seq 2369668399, ack 1608151935, win 43690, options [mss 65495,sackOK,TS val 262512900 ecr 262512900,nop,wscale 7], length 0
machine# [ 273.581701] proxy-tcpdump[944]: 01:47:21.468660 IP proxy.cardano.personal-agent > proxy.cardano.45856: Flags [F.], seq 5, ack 32, win 342, options [nop,nop,TS val 262512911 ecr 262512911], length 0
53.428666
an existing live connection is closed53.454622
another one is attempted, but never ever sees a reply TCP packet273.523524
another attempt occurs, and sees a reply -- that correlates with the slot 4 propagation times on the Rewrite sideOk, an interesting thing to note:
/proc/sys/net/ipv4/tcp_syn_retries
.This gives rise to a hypothesis:
UPDATE: /proc/sys/net/ipv4/tcp_syn_retries
is 6
, so -- retries should happen, but don't.
Weird!
Ok, after eliminating VDE from the picture, and re-seating everything over plain old loopback:
So the exception is either:
Pushed https://github.com/deepfire/network-transport-tcp/commit/b92cfe7c581a60c599ff40e14e417fa1eb8422d4 to disambiguate.
UPDATE:
Ok, it's https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1583
..and adding HasCallStack
further provides:
machine# [ 65.639138] na00m9p0s4n1vpadjmhg1jppqfh1iz5i-unit-script-cardano-node-legacy_-start[852]: CallStack (from HasCallStack):
machine# [ 65.646007] na00m9p0s4n1vpadjmhg1jppqfh1iz5i-unit-script-cardano-node-legacy_-start[852]: readMVarTimeout, called at src/Network/Transport/TCP.hs:1916:19 in network-transport-tcp-0.6.0-92TSEsn6W8qLlAnYBxd8Bu:Network.Transport.TCP
@avieth provided some context:
resolved
MVar is what causes the timeout (https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1551) -- (which I established previously)setupRemoteEndpoint
that toggles that MVar
-- in https://github.com/avieth/network-transport-tcp/blob/master/src/Network/Transport/TCP.hs#L1256So the next logical step for me was to trace down, where exactly in setupRemoteEndpoint
are we getting stuck.
A log obtained with https://github.com/deepfire/network-transport-tcp/blob/f99a715cdfe63e0bb59818840f60af75ce8c8290/src/Network/Transport/TCP.hs#L2081 -- seems to suggest that socketToEndPoint
progresses completely:
achine# [ 92.814698] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [ 92.816647] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [ 92.818643] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: -1: setupRemoteEndPoint -> socketToEndpoint
machine# [ 92.850125] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.1":"3001"
machine# [ 92.852109] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.4":"3004"
machine# [ 92.854131] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 3: <- setSocketOption -> connect "127.1.0.9":"5555"
machine# [ 92.862492] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.1":"3001"
machine# [ 92.864423] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.4":"3004"
machine# [ 92.868066] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 4: <- connect -> sendMany "127.1.0.9":"5555"
machine# [ 93.177586] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.1":"3001"
machine# [ 93.179541] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.4":"3004"
machine# [ 93.183533] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 5: <- sendMany -> recvWord32 "127.1.0.9":"5555"
machine# [ 93.188808] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.9":"5555"
machine# [ 93.190606] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.node:Debug:ThreadId 139] [2001-09-09 01:48:02.27 UTC] 7: <- socketToEndpoint 127.1.0.9:5555:0
machine# [ 93.447877] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.1":"3001"
machine# [ 93.449732] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.node:Debug:ThreadId 139] [2001-09-09 01:48:02.28 UTC] Waiting on the block queue or recovery header var7: <- socketToEndpoint 127.1.0.1:3001:0
machine# [ 93.455783] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: 6: <- recvWord32 "127.1.0.4":"3004"
machine# [ 93.457690] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.diffusion.outboundqueue.c-a-2:Debug:ThreadId 928] [2001-09-09 01:48:02.28 UTC] sending MsgAnnounceBlockHeader OriginSender to NodeId 127.1.0.4:3004:07: <- socketToEndpoint 127.1.0.4:3004:0
machine# [ 107.469733] v74fi7n4m5ykqvim9k9a8drmys4y3bh7-unit-script-cardano-node-legacy_-start[863]: [cardano-sl.diffusion.outboundqueue.c-a-2:Warning:ThreadId 932] [2001-09-09 01:48:17.28 UTC] sending MsgAnnounceBlockHeader OriginSender to NodeId 127.1.0.9:5555:0 failed with TransportError ConnectTimeout "findRemoteEndPoint: timed out" :: SomeException
(The above obtained using scripts/cluster-log.sh --limit 40 --ip 127.1.0.2 cluster.log cat | grep 'connect\|socketToEndpoint\|recvWord32\|timed out' | less
)
So that log seems insufficient, we need more.
Reproduction details:
0714765c31b506956feb0ebff2598cdef36b29cb
cardano-sl
checked out beside cardano-node
, at commit d5911d21eb2ce0d1e619c0321e7de630d56258b6
(tip of https://github.com/input-output-hk/cardano-sl/tree/serge/trace-network-transport-tcp)cardano-byron-proxy
checked out beside cardano-node
, at commit a25757473612b0d4cb8fd23077ef51bd3b4a8215
(tip of https://github.com/input-output-hk/cardano-byron-proxy/tree/avieth/protocol_version)cardano-node
checkout with ./scripts/cluster-test.sh --cls
-- which puts the logs in a new log file for each run (symlinked to cluster.log
, for the latest run)The topology used on the Byron side is: https://github.com/input-output-hk/cardano-node/blob/serge/mainnet-ci/configuration/mainnet-ci/topology-mainnet-ci.yaml -- it's set in https://github.com/input-output-hk/cardano-node/blob/serge/mainnet-ci/nix/nixos/cardano-cluster-service.nix#L37
With that, we have the following block propagation -- and you can see from the log from the previous message, that the timeout exception happens at the 107'th second:
deepfire@andromedae:~/cardano-node$ scripts/cluster-log.sh --limit 10 --cluster 10.1.0.%s 1 9 cluster.log headers-seen
ip\slot 0 1 2 3 4 5 6 7 8 9 10
10.1.0.1 26.0 32.7 52.2 72.7 92.6 --- 132.3 152.7 172.5 --- --- 232.5 --- 272.3
10.1.0.2 25.7 32.7 51.8 72.1 92.5 111.8 132.5 152.3 172.5 191.8
10.1.0.3 25.0 32.9 52.1 72.9 92.3 --- 132.6 152.5 172.6 --- --- 232.3 --- 272.7
10.1.0.4 24.4 32.5 52.3 72.6 92.5 --- 132.5 152.5 172.3 --- --- 232.6 --- 272.5
10.1.0.5 24.5 33.6 50.6 70.8 273.9 110.7 273.9 273.9 273.9 190.7
10.1.0.6 23.9 33.7 50.5 70.4 273.8 110.7 273.8 273.8 273.8 190.6
10.1.0.7 23.9 33.6 50.6 70.6 273.9 110.6 273.9 273.9 273.9 190.6
10.1.0.9 23.5 34.4 51.7 71.9 --- 111.7 --- --- --- 191.7 212.7 --- 251.6 276.0 295.3
I believe I've figured out why this is happening. How best to fix it is up for debate.
Spoiler: it's caused by the fact that multiple endpoint addresses can identify the same endpoint (because of DNS names). One way to fix it is to remove the --address
CLI argument from cardano-byron-proxy
.
The SL node c-a-2
and the Byron proxy proxy
form the single link between the Shelley and Byron clusters. i.e. the only Byron node which communicates with proxy
is c-a-2
. We noticed that a network-transport-tcp timeout would occur sometimes. On closer inspection, it would occur only at c-a-2
(the initiator) when contacting proxy
. And further, it would occur only when proxy
was also contacting c-a-2
around that same time.
Two notes about the design of network-transport-tcp:
ByteString
order) should send a special "crossed" message and then close. The other end will wait for that message to come.When c-a-2
and proxy
each get a new block around the same time, they will announce its header to one-another. Since there are no long-running network-transport-tcp lightweight connections (like a subscription) between these two, this brings up new heavyweight connections (TCP sockets) which can and sometimes do become crossed. In normal network-transport-tcp operation this is no problem: the protocol will work out which socket should stay and which should go.
However in this case we have proxy
claiming proxy.cardano:5555:0
as its endpoint address, c-a-2
claiming 127.1.0.2:3002
, and c-a-2
connecting not to proxy.cardano:5555:0
but to the numeric address it names 127.1.0.9:5555:0
!
proxy
will make the connection no problem. It will create its own socket, and when it sees the crossed connection from 127.1.0.2:3002:0
, it will check that indeed "proxy.cardano:5555:0" > "127.1.0.2:3002:0"
and therefore will not respond on that socket (it will use the one that it created earlier).
On the other hand, c-a-2
does not connect to proxy.cardano:5555:0
, but rather to its true identity 127.1.0.9:5555:0
! Crucially: it does not judge this to be a crossed connection, because it doesn't know that these strings identify the same endpoint. And so it times out waiting for a response, because proxy
throws the socket away by the crossed connection rule.
@deepfire I recommend never disabling the peer host check. Instead, use this revision of network-transport-tcp, which binds connecting sockets to <IPv4host>:0
. https://github.com/avieth/network-transport-tcp/commit/46f2942423d9ad3c81040565a9e9d5b9e08ddcc4
The issue was happily resolved.
In a mixed cluster setup,
cardano-sl
nodes periodically time out, while connecting to the proxy:This leads to loss of block propagation in the Legacy -> Rewrite direction, as can be seen from the block propagation chart: