chirpstack / chirpstack-gateway-bridge

ChirpStack Gateway Bridge abstracts Packet Forwarder protocols into Protobuf or JSON over MQTT.
https://www.chirpstack.io
MIT License
423 stars 272 forks source link

class C downlink issues #236

Open ederuiter opened 4 months ago

ederuiter commented 4 months ago

What happened?

Sometimes downlinks for class C devices are not sent directly, they are delivered minutes later after we received another uplink.

What did you expect?

Since the idea of class C devices is that downlinks can be sent at any time; I would expect not to have to wait until we receive an uplink.

Details

We use chirpstack to connect to the helium network. This means gateways don't maintain an active connection to chirpstack; instead we only get a packet when an uplink is sent from a device. This leads to issues as currently chirpstack-gateway-bridge clears the gateway address information after 1 minute of inactivity ( https://github.com/chirpstack/chirpstack-gateway-bridge/blob/master/internal/backend/semtechudp/registry.go#L21 ) So when we want to send a downlink to a device we would need to do that within a 1-2 minute timeframe of the last uplink of the gateway; otherwise the address of the gateway is cleared and the downlink cannot be sent. This is not an issue for class A/B devices as they are required to send their downlinks within that timeframe .. but for class C devices this leads to issues

For now we have deployed a local fix that increases the gatewayCleanupDuration to 24h to migite this issue. But this is not a permanent fix.

Ideally the gateway address information also needs to be persisted and shared among all instances of the gateway-bridges for this region, as otherwise a reboot, or loadbalancing could lead to the same issues.

Could you share your log output?

chirpstack-deployment-6bdb77dc77-kvxwf chirpstack-deployment 2024-05-24T12:50:48.311247817Z 2024-05-24T12:50:48.311062Z  INFO gRPC{uri=/api.DeviceService/Enqueue}: chirpstack::storage::device_queue: Device queue-item enqueued id=603965cc-a91b-418f-9517-11d95850c3b8 dev_eui=8c1f64a870000025
chirpstack-deployment-6bdb77dc77-kvxwf chirpstack-deployment 2024-05-24T12:50:48.655241160Z 2024-05-24T12:50:48.655018Z  INFO schedule{dev_eui=8c1f64a870000025 downlink_id=2394125123}: chirpstack::storage::device_queue: Device queue-item updated id=603965cc-a91b-418f-9517-11d95850c3b8 dev_eui=8c1f64a870000025
chirpstack-deployment-6bdb77dc77-kvxwf chirpstack-deployment 2024-05-24T12:50:53.679587870Z 2024-05-24T12:50:53.679184Z  INFO schedule{dev_eui=8c1f64a870000025 downlink_id=4261746332}: chirpstack::storage::device_queue: Device queue-item updated id=603965cc-a91b-418f-9517-11d95850c3b8 dev_eui=8c1f64a870000025
.. continues every 5 seconds
chirpstack-deployment-6bdb77dc77-kvxwf chirpstack-deployment 2024-05-24T12:56:47.143187955Z 2024-05-24T12:56:47.142963Z  INFO up{deduplication_id=07671fe3-efc1-4182-86d1-9593a60fefe0}:data_up{dev_eui="8c1f64a870000025"}:data_down{downlink_id=4007633762}: chirpstack::storage::device_queue: Device queue-item updated id=603965cc-a91b-418f-9517-11d95850c3b8 dev_eui=8c1f64a870000025
chirpstack-deployment-6bdb77dc77-kvxwf chirpstack-deployment 2024-05-24T12:56:47.213894563Z 2024-05-24T12:56:47.213751Z  INFO tx_ack{downlink_id=4007633762}: chirpstack::storage::device_queue: Device queue-item deleted id=603965cc-a91b-418f-9517-11d95850c3b8

Your Environment

Component Version
Chirpstack v4.7.1
Gateway Bridge 4.0.11

PS: Happy to help putting together a PR, but let's first figure out how to approach this

brocaar commented 4 months ago

I think the real issue is that the UDP protocol might be used in a different way from how it is intended to be used.

Each gateway must periodically send a PULL_DATA packet to keep the UDP route open (NAT and / or firewall). The normal PULL_DATA interval is 10 seconds. See also: https://github.com/Lora-net/packet_forwarder/blob/master/PROTOCOL.TXT#L289

This PULL_DATA is what is updating the state of the connection within the ChirpStack Gateway Bridge: https://github.com/chirpstack/chirpstack-gateway-bridge/blob/master/internal/backend/semtechudp/backend.go#L331

If Helium is not sending PULL_DATA packets, then I think it is the correct behavior that the ChirpStack Gateway Bridge invalidates the UDP connection. Due to the nature of UDP, you probably do not want to set the gatewayCleanupDuration to a very high value. A quick search on "udp nat timeout" and "udp firewall timeout" gives me numbers like 30 - 60 seconds.

ederuiter commented 4 months ago

Hmm, yeah the helium packet routers basically impersonate each gateway. I think it uses unique port numbers for each gateway so it can identify which gateway the packets are destined for. But I agree it is a bit of misuse of the protocol, but I also can see that it would not be feasible for the helium packet routers to maintain active udp connections to each lns for each gateway that has received a packet for one of their devices (with ~400.000 active gateways)

In this specific case (helium) we know that the ip:port of the gateways are (mostly) stable, as they don't refer to each gateway, but to the helium packet routers which have public ip's and are not behind nat/firewall etc. We could still have some timeout issues with loadbalancing on our side, but that is something we can (hopefully) tweak.

Would you accept a PR to make this setting configurable? That would allow us to easily work around this without having to recompile the gateway-bridge. I have also brought up this issue on the helium side and are also discussing on what they can do from their side.

NB: yes there are other options than semtech udp to connect chirpstack LNS's to helium 1) via lorawan roaming => unfortunately chirpstack only support roaming V1.0; and helium uses V1.1 2) via the packet router (helium specific protocol) => this is probably the best option, but I am unsure of the support of this from the helium side; currently looking into this

brocaar commented 4 months ago

Happy to accept a PR to make this option configurable (with as default the current value), so that this can be adjusted.

With regards to a proper solution, I agree that 2. is probably the best option. E.g. we could create a chirpstack-helium-bridge that integrates with the Helium API and transforms the data into the ChirpStack MQTT format. I like 2) over 1) because it makes the architecture simpler and easier to debug. The roaming API makes things a lot more complex to debug + there are no inbound connections required from Helium > ChirpStack.

ederuiter commented 4 months ago

:+1: :100: agree

You can expect a PR from me/a colleague of mine to add gatewayCleanupDuration to the configuration in the next couple of days.

For the chirpstack-helium-bridge: helium already has this: https://github.com/helium/helium-packet-router-ingest which does convert from packet router to gwmp/http roaming. Should be easy enough :tm: to use this as a basis for it.

This would simplify deployments of helium lns's with chirpstack a lot; I will talk to helium about this and see how we can expedite this