Documentation: Gateway Packet Filtered after restarting stack while UDP gateways are still connected

jpmeijers commented 4 years ago

Summary

Run The Things Stack using Docker Compose. Connect a gateway that uses the UDP packet forwarder. Update and restart the stack using docker-compose pull and then docker-compose up --force-recreate --build -d.

After the stack is restarted all gateway uplink traffic is dropped with the following error message:

stack_1      | 2020-02-28T06:46:35.089035340Z   WARN Packet filtered                          connected_ip=192.168.16.1 connecting_ip=169.0.x.x error=error:pkg/gatewayserver/io/udp:already_connected (gateway is already connected) gateway_eui=B827EBFFFExxxxxx namespace=gatewayserver/io/udp

To fix this, one can restart the gateway. This is however not possible in a production setup.

@htdvisser provided a couple of workarounds on Slack

This can indeed happen on some systems. Has something to do with Docker and UDP connection tracking in the kernel. Can be resolved by:

Waiting for the packet filter of The Things Stack to expire (but that will result in incorrect IPs)

Reconnecting the gateway so that it uses a different source port and the kernel sees it as a new connection

Running conntrack -D -p udp to flush the states of UDP connections on your host OS (edited)

Why do we need this ?

Document the process to follow to get UDP gateways connected directly after updating and restarting the stack. This is necessary in a production environment.

Can you do this yourself and submit a Pull Request?

Probably

Environment

TheThingsStack v3.6.0

docker-info.txt docker-inspect.txt

No /etc/docker/daemon.json on my system, as no custom configuration has been done.

htdvisser commented 4 years ago

This issue is missing some details about your environment:

Please copy-paste the output of docker info
Please copy-paste the contents of /etc/docker/daemon.json
Please copy-paste the output of docker inspect [container] (where [container] is your "stack" container.

As always, you may redact sensitive information.

htdvisser commented 4 years ago

This is an issue with Docker itself, and is unrelated to The Things Stack, so won't be able to fix this. We could however see how we can document it so that users are aware of the issue.

Here's a Gist with details on how to reproduce the problem using netcat: https://gist.github.com/htdvisser/d3a7aff5c05d417964ee33149f96e523

htdvisser commented 4 years ago

It looks like the only way to work around this issue is to set network_mode: host for the "stack" service, but I think that would then break routing to the databases

johanstokking commented 4 years ago

This all assumes that our UDP packet filtering is a sealed solution, but I would argue that the current implementation is pretty rudimentary and there's enough room for improvement.

For example, a very simple next iteration is observing some PULL_DATA messages, measure the interval between them and take that as heartbeat. Only when a new address is seen from the same gateway and the heartbeat of the old address is still active, we should filter the traffic from the new address. That can go quite fast, i.e. in 10-15 seconds.

htdvisser commented 4 years ago

@johanstokking please update the issue with what you think needs to happen.

johanstokking commented 4 years ago

I don't think something should happen here for the time being. Leaving this in Backlog until we gather more ideas, thumbs ups from other community members and progressive insight.

Bwooce commented 3 years ago

I don't understand why the existing config doesn't cause these to drop. I'l have to set up a test env.

For example we have a connection_expires setting, defaulting to 1 minute

That alone should be evicting stale gateways. What am I missing about the connection staying alive?

Even in the case that configured time period has been extended, the particular line of code blocking this situation should be adding on blockTime (def, 1m) to the lastSeen timestamp. How can the lastSeen timestamp be recent?

Bwooce commented 3 years ago

It is somehow interrelated with the Docker issue, but even then the observed behaviour so far doesn't match what we're seeing as that describes packets being sent to a "no longer existing" container until the conntrack entries expire. Default expiry on linux is 30s, but even 5m is shorter than we're seeing in the real world -- the ttn servers don't have any kernel options altered to make this super long @htdvisser ? e.g. net.netfilter.nf_conntrack_udp_timeout and/or net.netfilter.nf_conntrack_udp_timeout_stream

If the packets are routing to the new container, so conntrack is fine, then the TTN udp firewall could drop them with that message except the new container wouldn't have the gateway connection state and so should see all these connections as new (and not error) - all the firewall state is in-memory so ephemeral to the container. So I can't explain @jpmeijers observation.

Bwooce commented 3 years ago

Ah, from @htdvisser on Slack, on #ops a while ago (4 months? Luckily I can still read it) "The default timeout is 1 minute, but that doesn't help if the gateway is actually connected. UDP gateways typically open 2 UDP streams. One for uplink and another for downlink. If Docker somehow shows the correct IP for the downlink stream, but the incorrect IP for the uplink stream, The Things Stack will not expire the IP mapping"

So in timeline terms:

Server restarts, zero UDP GW's known 1a. If a GW sends a PULL_DATA keep-alive before the next status PUSH_DATA, then the server sees the PULL_DATA with the internal IP. It can't get a reply to the GW (internal IP) but with this protocol that doesn't matter terribly; the GW will just keep sending PULL_DATAs in the hope it'll get an ACK/RESP. Any PUSH_DATA comes with a different (correct) IP which generates the error. OR 1b. If a GW sends a PUSH_DATA uplink/status message before the PULL_DATA, then the server sees the PUSH_DATA and logs/uses that IP. Possibly this half-works (arguably worse), or more likely, as the PULL_DATA's are now being rejected due to a different IP, the GW treats the lack of a PULL_DATA ACK as a reason to tear down both connections and then we're back to [1a].
Because the GW keeps sending PULL_DATA's that come through with the internal IPs, the lastSeen is valid and the conntrack keeps being refreshed. This situation can persist indefinitely (I had it > 30 days for one GW)

Just clearing it from the connected GW list in the server isn't going to solve it because the internal IP continues to be presented and the conntrack keeps being refreshed -- it'd be recreated almost immediately with the 172.x address. App level filtering of internal IPv4 addresses won't help either as the packets will keep on coming from the GW because it never gets it's PULL_DATA ACKs, the conntrack/Docker weirdness is the issue.

I'm seeing where you all came from now. I've not seen this level of subtle brokenness described on the Docker bug though, and @htdvisser has a repro for it (nice!) so that'll be possible to get fixed there. edit: Found a bug for it, and the exact scenario description fits too edit2: This workaround looks viable. The code fix looks hard as it's a consequence of Netfilter being protocol agnostic, a patch on that doesn't sound likely to be accepted. Getting this conntrack behaviour done on container restart maybe more likely though.

johanstokking commented 3 years ago

@ama9910 @KrishnaIyer this is exactly what happened recently in your case.

virtualguy commented 3 years ago

We are also suffering from this, most likely ill just move to lora-basic-station

KrishnaIyer commented 3 years ago

This is indeed an unfortunate case of the UDP protocol implementation coupled with docker. We do highly recommend switching to LoRa Basics Station.

As a short term hack, you can try to clear UDP cache via

$ sudo conntrack -D -p udp

johanstokking commented 3 years ago

We are also suffering from this, most likely ill just move to lora-basic-station

That's a great idea.

TheThingsNetwork / lorawan-stack