cloudflare / cloudflared

Cloudflare Tunnel client (formerly Argo Tunnel)
https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/tunnel-guide
Apache License 2.0
9.4k stars 837 forks source link

💡Reduce constant minimum Tunnel connection bandwidth (follow-on issue) #949

Open iicdev opened 1 year ago

iicdev commented 1 year ago

Describe the feature you'd like This is specific followup request from issue which was closed: "Significantly reduce basic Tunnel bandwidth for cell modem #935" That response partially resolved the core issue (see below). We use cell modems at some customer sites to our AI-based BMS (building management system), running Windows, controlling tens of IoT control devices. We need to minimize expensive cell modem bandwidth. We still need each CONNECTION to consume less bandwidth. See attached screenshots, which prove each connection uses 0.8-1.1MB/hour simply to maintain itself, not counting application data exchanged. We accept lower reliability and slow reconnection of a lost connection. We would expect the solution to be sending metrics & keep-alive packets much less frequently.

Describe alternatives you've considered --metrics-update-freq 30 (default is 5) -- in early testing, we didn't see any improvement --ha-connections 1 -- per above-referenced issue #935, that DID reduce number of connections from 4 to 1, which did indeed reduce bandwidth 75%. Thank you!

Additional context Screenshot of cell modem traffic in MB attached. Hour periods with under 0.25MB are no Tunnel running. Those near/above 1MB are with Tunnel running, with only ONE connection (--ha-connections 1). That one connection typically consumes 1.1MB (Tunnel running)-0.25MB(Tunnel not running)=0.85MB/hour. Ignore larger MB hours - that is application or Windows traffic.

CellModemTraffic-TunnelRunning-NotRunning

obezuk commented 1 year ago

Glad the previous suggestion has helped reduce bandwidth consumption. metrics-update-freq affects how often the local metrics endpoint gets updated. This data isn't pushed to Cloudflare so has no affect.

The system is designed to prioritize resilience in this case and we don't have any client-side levers that could be used to reduce keep-alive bandwidth.

What would good look like here? How long could you tolerate a tunnel being offline before it attempted to restore a connection?

If you're connected to an account team at Cloudflare I think it would be best to follow up with your account manager to discuss in more depth.

iicdev commented 1 year ago

Tim, thanks for the helpful response.

We would like to further reduce bandwidth by a typical factor of 2 to 4, e.g. for each connection during normal access periods.

Yes, we understand clearly the tradeoff, and that your current use case is high reliability & uninterruptibility. We can accept delay in connection restoration by 30-60 seconds during normal access periods. We are currently shutting down the Tunnel completely during unlikely/low priority periods for connection (using taskkill & chron); we would clearly like to be able to leave the Tunnel running at even lower reliabiliity (e.g. 5-10 minutes recovery) during those periods (we would expect to kill the current running Tunnel & restart with different options).

Incidentally, do you realize why we have this requirement? Cell providers do not allow incoming IP connections of any type inbound to the cellular modem. Outbound connections are open. (Yes, we're looking at some defensive additions to our own outbound system heartbeat but would rather be able to leave it all under Tunnel control.)

We are happy to provide continuing data on this use case if it's helpful to you - there are certainly others with this need, & IoT requirements barrelling toward you. Thanks again for the effective communication.

morpig commented 1 year ago

@iicdev you have an interesting use-case. have you tried switching between quic & http2 protocols and see the difference?

also might be off-topic, but have you considered other solutions too (e.x tailscale, mesh vpns?), especially in a bandwidth constraint env?

iicdev commented 1 year ago

@morpig Thanks so much for the idea - we will carefully test http2 protocol. Will post results for community reference.

Note that NO inbound connections are allowed on a cellular modem, so any other solution would still need a continuous outbound tunnel - e.g. tailscale would need DERP servers anyhow. So no conceptual win, except of course possible lighter-weight tunnelling. Yes, interesting.

ashneilson commented 11 months ago

@iicdev I'm curious how your evaluation of http2 instead of quic is going. Have you managed to reduce data usage to your ideal level?

iicdev commented 10 months ago

In the end, setting ha-connections = 1 reduced it sufficiently that we didn't take time to test http2. (Even further would be good!) If we do so in the future, we will post the results.

ashneilson commented 10 months ago

Thanks for the update @iicdev.

Do you also find the status of any tunnels with ha-connections = 1 shows as Degraded?

image
2788west commented 6 months ago

I have been working on a number of IoT services for different clients over the years (all using cellular modems), so I second this feature request. Cloudflared is better than many other solutions we've tested in terms of the device client, security, and ease-of-use but the keep-alive communication of cloudflared is currently a significant cost driver.

As suggested previously, it would be great to have a configurable communication cycle time that would allow e.g. a 5 minute cycle time, which would be completely acceptable for field applications that only require occasional remote management. A reduced communication cycle should also significantly reduce data usage as far as I understand.

ashneilson commented 6 months ago

@obezuk What's your thoughts on how likely we might see a solution for this?

Is it too niche / not a use case Cloudflare is focused on?

iicdev commented 6 months ago

@obezuk @ashneilson @2788west I fear they are being narrow minded in their prioritizing this IoT market low. There are only 1 to 3 seemingly simple parameters which would handle this, for retry count, time delay and timeout. We looked at the code & with a few additional lines at each of those locations, plus the simple setting of command line text parameters to internal variables it's done. As I remember, the key locations for modifications are in a 3rd party library (probably for quic), not Cloudflared code (except the calls to those libraries & the parameter setting), so you're dealing with 2 levels of open source contribution. Since we are not ongoing contributors to either code base, we are sadly not able to confidently submit changes like this.