Connection events are fired before the connection is added

TimonPost / laminar

A simple semi-reliable UDP protocol for multiplayer games

827 stars 66 forks source link

Connection events are fired before the connection is added #277

Closed ncallaway closed 4 years ago

ncallaway commented 4 years ago

Heartbeat messages do not begin sending if a Packet is received, they will only begin sending after an outbound Packet is sent. This causes two issues:

1) The opposite side of the connection will perceive this as a Timeout event and terminate the connection (since it doesn't receive any heartbeat messages).

2) This side of the connection will see each received heartbeat as a new connection event, which leads to repeated Connection events.

An example of this can be found here: https://github.com/ncallaway/laminar-heartbeat-demo

This creates a server (:12350) and a client (:12355). Each are configured to send heartbeats every 2 seconds and timeout after 5 seconds.

idle_connection_timeout: Duration::from_secs(5),
heartbeat_interval: Some(Duration::from_secs(2)),

The client, on startup will send a Ping to the server. The server can optionally respond with a pong message.

What we see is that if the server responds with the pong message, heartbeats flow in both direction and the connection is maintained on both ends.

If the server does not respond with a pong message, heartbeats only flow from the client to the server, until the client times out. The server sends a new Connection event for each heartbeat that it receives, and never reports a timeout event after the client stops sending heartbeat messages.

Ping

Wireshark Capture

ping-capture

Server

$ cargo run -- -s
[164.4µs] Bound to 127.0.0.1:12350
        [2.43s] Connection from: 127.0.0.1:12355
[2.43s] Received "Ping" from V4(127.0.0.1)
        [4.42s] Connection from: 127.0.0.1:12355
        [6.42s] Connection from: 127.0.0.1:12355

Client

$ cargo run -- -c
[147.0µs] Bound to 127.0.0.1:12355
        [269.30µs] Sent: Ping
        [5.00s] Timeout from: 127.0.0.1:12350

Ping w/ Server Pong

Wireshark Capture

ping-pong-capture-1

Server

$ cargo run -- -s -p
[116.8µs] Bound to 127.0.0.1:12350
        [1.32s] Connection from: 127.0.0.1:12355
[1.32s] Received "Ping" from V4(127.0.0.1)
        [1.32s] Sent: pong
        [64.33s] Timeout from: 127.0.0.1:12355

Client

$ cargo run -- -c
[154.3µs] Bound to 127.0.0.1:12355
        [273.42µs] Sent: Ping
[24.87ms] Received "pong" from V4(127.0.0.1)

# client interrupted

ncallaway commented 4 years ago

I've found a fix that appears to resolve that issue:

In the connection manager handling inbound packets:

https://github.com/amethyst/laminar/blob/07eef710787aef8a784c5dc268c153a89a3b92ce/src/net/connection_manager.rs#L107-L114

If we add disregard the comment and add the connections to the list, this behavior is largely resolved. We dispatch a Connection event on receiving an inbound message, and start sending outbound heartbeats because the connection is added to the list.

  conn.process_packet(messenger, payload, time);
  + self.connections.insert(address, conn);
}

My main concern is that doing this explicitly contradicts the comment in that section. Before I make a PR I want to do some git history digging and exploration and try to find where that came from, but it seems very intentional that we weren't adding the connection to the list.

If @TimonPost or anyone else knows about the history of that off the top of your heads, I'd be interested in hearing it. It makes sense to me to start a connection on either receiving an inbound packet or sending an outbound packet, but if there are reasons to the contrary they should be considered.

ncallaway commented 4 years ago

It looks like this was introduced in this PR: https://github.com/amethyst/laminar/pull/258. Given that was a large refactor, I'm somewhat inclined to think the decision not to include the connections on receive was a minor mistake as part of what looks like a really awesome refactor.

@fraillt I know it's been almost a year since that refactor so you might not remember, but is there any chance you remember why you decided to create the connection but not yet add it to the connections list?

fraillt commented 4 years ago

Hello, I don't remember exact events that lead to this, but it's certainly intensional. I remember having a discussion on discord and also created issue #242. The main idea behind this was that connection should only be accepted when a server actually responds to it. If you noticed it, the only place where a connection is added is when handling events from server. Later, followed a discussion, that it would be nice to have proper connection management. @TimonPost was actually working on DTLS (don't know what state it's in now), and I started working on https://gafferongames.com/post/client_server_connection/, as it was perceived a nice to have addition while DTLS is complete. Although I had almost completed it, I wasn't happy with it as it looked that it could be simplified a lot, I just didn't know how :) then, I had some personal issues and I never came back to it...

TLDR; It is intensional and it was always like that, even before refactoring. The main idea behind this was that connection should only be accepted when a server actually responds to it.

TimonPost commented 4 years ago

@fraillt is correct. There was an issue that noted a vulnerability for a vector attack that floods the memory in the protocol. In order to fix this, we have implemented a temporary workaround to only add the connection if the server responds with a message to the client. This behavior allows for simple connection management.

I want to remove this behavior and implement a more mature way of connection management. But while doing so I came to the conclusion the ecosystem isn't far enough yet and it would have cost me too much effort to implement handshaking my self. I wanted to use DTLS of this and that is why I wrote https://github.com/TimonPost/udp-dtls. However, this only supports one socket to one socket and not one socket with many sockets which is needed for server -> client connection management.

There are ways to implement this with openssl however those edits were needed: https://github.com/TimonPost/udp-dtls/issues/2. As the issue described lots of edits were needed to get this working and we would be stuck with a C dependency to opensll. And the only rust alternative would be Rustls, but this library doesn't support DTLS and is waiting for the draft of DTLS 1.3 to be completed.

So I would be left with the option or implement some kind of handshaking my self or implement DTLS 1.3 for Rustls. Because of my new year resolutions and lack of knowledge in this security aspect, I dropped the project.

ncallaway commented 4 years ago

That makes sense. I'll close the PR. I agree that implementing an proper handshake is a better resolution to this issue.

I'll start researching how I can help on that front, or if we just should cool our heels for the DTLS 1.3 draft.

In the meantime, I think this is still a pretty significant issue for laminar users. This is mentioned on the important notices page, but I think some users are still missing this fact.

I think there are two or three things we could do to help new laminar users have the right expectations around this behavior:

Normalize SocketEvent::Connection events to only occur when the connection is actually added to the list. This will both prevent the repeated connections behavior described here, but would also force users to discover this behavior earlier (since most users probably interact with the connection event early). This is the main thing that tripped me up, because even though I had read the notice about connections and knew I needed to pong from the server, I interpreted the connection event as meaning heartbeats would be flowing.
Add documentation of this behavior in key places (like SocketEvent::Connection). This way, if a user expects a connection to happen on just a packet receipt will be confused about why a connection event isn't firing and hopefully check the Connection documentation.
I feel less certain about this one, but we might also change the requirement to establish a connection to both the send and receipt of a packet. This change would be more to simplify the mental model for a user of a connection: A connection is an endpoint to which you have both sent and received a Packet from within the timeout window. It makes the concept of a connection feel more symmetric between client and server.

I'll start exploring the first two as a resolution to this issue, and we can have a conversation around the third item.

TimonPost commented 4 years ago

Awesome! Thanks for your contributions already. It would be amazing to finish this protocol for it to be working fluently.

In the meantime, I think this is still a pretty significant issue for laminar users. This is mentioned on the important notices page, but I think some users are still missing this fact.

I fully agree with you. I would maybe even say let's remove this whole 'pong' idea and go for 3. I wonder whether the risk (as described above) is greater then the struggles it gives us. It is a quick fix that has been here for 1 year and has caused more confusion than that it solved. What do you think?

Re: implementing DTLS, I tried it, but it requires much knowledge about TLS, SSL, handshaking, and security. For me personally, I don't have that knowledge that's why I took my hands back such that I would not burn them. But if you feel comfortable in those fields and can help RustLS that would be amazing. There was an existing pull request for DTLS 1.2. I even rewrote that one to be compatible with the newest Rustls version. Because that pull request dates back 2 years ago. Lots of things have changed since that time. My pull request would be a good start.

ncallaway commented 4 years ago

I definitely don't think I'm to to implementing DTLS on my own without guidance. I think on that front, we should probably wait for DTLS 1.3 and rustls to work on it. I think I'd be comfortable helping implement DTLS 1.3 with guidance from the author of rustls, but that'll need to wait until he's ready to work on it.

What I was thinking in the meantime is that laminar could implement a custom "laminar handshake" that might be something like a very-simplified toy version of DTLS, with the plan of migrating the "laminar handshake" to DTLS 1.3 when rustls makes it available. The custom handshake could include a basic mitigation against flooding attacks.

I think tackling something like the custom handshake (with some advice on how to integrate it into the existing laminar structure) would be pretty feasible for me, but might take a bit of planning.

I fully agree with you. I would maybe even say let's remove this whole 'pong' idea and go for 3. I wonder whether the risk (as described above) is greater then the struggles it gives us. It is a quick fix that has been here for 1 year and has caused more confusion than that it solved. What do you think?

Alternatively, if others agree that we're willing to sacrifice some flooding safety here for this, we could come back to this pull request.

We could add a little bit more safety to this PR by doing something like counting the number of unacknowledged connections (where the calling code hasn't ever sent an outbound packet), and limit those connections to some fixed size (say 15 or something). If we are already maxed out on unacknowledged connections, then incoming packets would not be added to the connections map (the same behavior as now).

This would require adding one new bool field to the VirtualConnections, and a new u8 field to the ConnectionManager.

I think this would give us back some basic flooding safety, by putting a hard cap on the amount of memory UDP flooding could consume. This second approach would be pretty quick for me to implement.

ncallaway commented 4 years ago

I keep throwing out options, but you also asked my opinion so I'll give that too.

I think the easiest path forward (outlined more specifically below) would put us on a good path to implementing a stub-DTLS-like-handshake, which would put us on a good path to implementing DTLS 1.3 whenever it's available.

So I'd recommend the following path:

Now: Simple Connection changes
Next: Add stub-DTLS-like-handshake
Eventually: Replace the stub-DTLS-like-handshake with DTLS 1.3 from rustls

My only concern with this approach is the amount of churn it might add to the laminar interface and protocol. I know we're pre-1.0, but I'm not sure how many times we want to change behavior around Connections. If we want to minimize churn, then I think we would rather just go straight to the stub-DTLS-like-handshake.

Simple Connection changes

We define a Connection (as far as the events / interface / documentation is concerned) as both a send and a receive of a user Packet
Internally, create and store VirtualConnection on either a send or receive.
Add a flag to each VirtualConnection to track whether it has ever had an inbound Packet and an outbound Packet
Limit the number of inbound, unacknowledged VirtualConnections to some fixed (configurable) size.
Heartbeats are only sent for a VirtualConnection that has both sent and received a packet.

This behavior would still require a laminar user to respond to each inbound message in order for a Connection event to be triggered. The SocketEvent::Connect events would be symmetric between the client and the server (e.g. they both dispatch a Connect event once the server respond's with a message. Additionally, heartbeats would match the SocketEvent::Connect dispatch.

TimonPost commented 4 years ago

I like your proposal for preventing flooding attacks while removing the current behavior that causes problems. It seems to me that this change is a nice intermediate solution between removing and keeping the functionality. That said you mention " but I'm not sure how many times we want to change behavior around Connections", With your proposal, we would not break any user code because their handshake would maybe be unnecessarily but still useful to mark the connection as 'trusted'.

I would suggest not to go for a stub-DTLS-like-handshake protocol. For the following reasons:

The work will be for nothing when DTLS 1.3 is released.
Simple connection management would fulfill the needs for laminar now, which are connection events and prevention of flooding.

ncallaway commented 4 years ago

Great. I'll start on that over the weekend. Shouldn't take long

TimonPost commented 4 years ago

This should be fixed by now