When connecting to a NATS server from Windows, if the connection to
the NATS server unceremoniously disappears (e.g., somebody yanks the
network cable out of the machine), rants would go into an infinite
loop, repeatedly logging the following, while maxing out the CPU:
TCP socket error, err: An existing connection was forcibly closed
by the remote host. (os error 10054)
When processing the call to reader.next() in the main message
handling loop, an error of this kind would be treated as recoverable,
and we would continue into the loop again. reader.next() would
immediately return the same error again, leading to an infinte loop.
Interestingly, this does not appear to affect Linux clients;
monitoring the socket with netstat shows that the connection still
shows up as ESTABLISHED. On Windows, on the other hand, the socket
quickly disappears from netstat's output once the server's network
connection drops.
The core of this fix is basically changing that continue to a
break, meaning that we treat such an error as a disconnection. With
this change, we avoid the infinite loop.
Since the returned value from the decoder stream is a deeply nested
type with multiple Result layers, and since the deconstruction of
that type was repeated in multiple places, I encoded the logic in a
separate "disposition" Enum, which should hopefully make reasoning
about the various cases a bit easier.
@christophermaier Will the supervisor still be able to reconnect after the connection to automate is restored? I'm assuming it'll still attempt to connect for each health-check.
When connecting to a NATS server from Windows, if the connection to the NATS server unceremoniously disappears (e.g., somebody yanks the network cable out of the machine),
rants
would go into an infinite loop, repeatedly logging the following, while maxing out the CPU:TCP socket error, err: An existing connection was forcibly closed by the remote host. (os error 10054)
When processing the call to
reader.next()
in the main message handling loop, an error of this kind would be treated as recoverable, and we would continue into the loop again.reader.next()
would immediately return the same error again, leading to an infinte loop.Interestingly, this does not appear to affect Linux clients; monitoring the socket with
netstat
shows that the connection still shows up asESTABLISHED
. On Windows, on the other hand, the socket quickly disappears fromnetstat
's output once the server's network connection drops.The core of this fix is basically changing that
continue
to abreak
, meaning that we treat such an error as a disconnection. With this change, we avoid the infinite loop.Since the returned value from the decoder stream is a deeply nested type with multiple
Result
layers, and since the deconstruction of that type was repeated in multiple places, I encoded the logic in a separate "disposition"Enum
, which should hopefully make reasoning about the various cases a bit easier.Signed-off-by: Christopher Maier christopher.maier@gmail.com