Open jpmeijers opened 4 years ago
This issue is missing some details about your environment:
docker info
/etc/docker/daemon.json
docker inspect [container]
(where [container]
is your "stack" container.As always, you may redact sensitive information.
This is an issue with Docker itself, and is unrelated to The Things Stack, so won't be able to fix this. We could however see how we can document it so that users are aware of the issue.
Here's a Gist with details on how to reproduce the problem using netcat: https://gist.github.com/htdvisser/d3a7aff5c05d417964ee33149f96e523
It looks like the only way to work around this issue is to set network_mode: host
for the "stack" service, but I think that would then break routing to the databases
This all assumes that our UDP packet filtering is a sealed solution, but I would argue that the current implementation is pretty rudimentary and there's enough room for improvement.
For example, a very simple next iteration is observing some PULL_DATA
messages, measure the interval between them and take that as heartbeat. Only when a new address is seen from the same gateway and the heartbeat of the old address is still active, we should filter the traffic from the new address. That can go quite fast, i.e. in 10-15 seconds.
@johanstokking please update the issue with what you think needs to happen.
I don't think something should happen here for the time being. Leaving this in Backlog until we gather more ideas, thumbs ups from other community members and progressive insight.
I don't understand why the existing config doesn't cause these to drop. I'l have to set up a test env.
For example we have a connection_expires setting, defaulting to 1 minute
That alone should be evicting stale gateways. What am I missing about the connection staying alive?
Even in the case that configured time period has been extended, the particular line of code blocking this situation should be adding on blockTime (def, 1m) to the lastSeen timestamp. How can the lastSeen timestamp be recent?
It is somehow interrelated with the Docker issue, but even then the observed behaviour so far doesn't match what we're seeing as that describes packets being sent to a "no longer existing" container until the conntrack entries expire. Default expiry on linux is 30s, but even 5m is shorter than we're seeing in the real world -- the ttn servers don't have any kernel options altered to make this super long @htdvisser ? e.g. net.netfilter.nf_conntrack_udp_timeout
and/or net.netfilter.nf_conntrack_udp_timeout_stream
If the packets are routing to the new container, so conntrack is fine, then the TTN udp firewall could drop them with that message except the new container wouldn't have the gateway connection state and so should see all these connections as new (and not error) - all the firewall state is in-memory so ephemeral to the container. So I can't explain @jpmeijers observation.
Ah, from @htdvisser on Slack, on #ops a while ago (4 months? Luckily I can still read it) "The default timeout is 1 minute, but that doesn't help if the gateway is actually connected. UDP gateways typically open 2 UDP streams. One for uplink and another for downlink. If Docker somehow shows the correct IP for the downlink stream, but the incorrect IP for the uplink stream, The Things Stack will not expire the IP mapping"
So in timeline terms:
Just clearing it from the connected GW list in the server isn't going to solve it because the internal IP continues to be presented and the conntrack keeps being refreshed -- it'd be recreated almost immediately with the 172.x address. App level filtering of internal IPv4 addresses won't help either as the packets will keep on coming from the GW because it never gets it's PULL_DATA ACKs, the conntrack/Docker weirdness is the issue.
I'm seeing where you all came from now. I've not seen this level of subtle brokenness described on the Docker bug though, and @htdvisser has a repro for it (nice!) so that'll be possible to get fixed there. edit: Found a bug for it, and the exact scenario description fits too edit2: This workaround looks viable. The code fix looks hard as it's a consequence of Netfilter being protocol agnostic, a patch on that doesn't sound likely to be accepted. Getting this conntrack behaviour done on container restart maybe more likely though.
@ama9910 @KrishnaIyer this is exactly what happened recently in your case.
We are also suffering from this, most likely ill just move to lora-basic-station
This is indeed an unfortunate case of the UDP protocol implementation coupled with docker. We do highly recommend switching to LoRa Basics Station.
As a short term hack, you can try to clear UDP cache via
$ sudo conntrack -D -p udp
We are also suffering from this, most likely ill just move to lora-basic-station
That's a great idea.
Summary
Run The Things Stack using Docker Compose. Connect a gateway that uses the UDP packet forwarder. Update and restart the stack using
docker-compose pull
and thendocker-compose up --force-recreate --build -d
.After the stack is restarted all gateway uplink traffic is dropped with the following error message:
To fix this, one can restart the gateway. This is however not possible in a production setup.
@htdvisser provided a couple of workarounds on Slack
Why do we need this ?
Document the process to follow to get UDP gateways connected directly after updating and restarting the stack. This is necessary in a production environment.
Can you do this yourself and submit a Pull Request?
Probably
Environment
TheThingsStack v3.6.0
docker-info.txt docker-inspect.txt
No
/etc/docker/daemon.json
on my system, as no custom configuration has been done.