cloudflare / cloudflared

Cloudflare Tunnel client (formerly Argo Tunnel)
https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/tunnel-guide
Apache License 2.0
8.79k stars 775 forks source link

🐛"stream #### canceled by remote with error code 0" connIndex=0 event=1 ingressRule=0 originService=" Started roughly 7/19 #1300

Open dledfordcf opened 1 month ago

dledfordcf commented 1 month ago

Describe the bug Opening this as a centralized place for this issue.

To Reproduce Was unable to directly reproduce on my own tunnel, but the from what I have gathered from others: The issue happens at random, rebooting tunnel resolves the issue temporarily, but it will resurface.

Unifying fact seems to be version 2024.6.1 and issues starting around 7/19

If it's an issue with Cloudflare Tunnel:

  1. Tunnel ID : Multiple
  2. cloudflared config:

Expected behavior Tunnel would connect to edge and work

Environment and versions

Logs and errors I dont have any of my own logs for this

Additional context No tunnel updates were released when this started as 2024.6.1 has been out around a month, but there is a large grouping of people with this starting around the same time of 7/19/2024

If anyone from the Cloudflare team checks this bug as well, feel free to hit me up internally.

dledfordcf commented 1 month ago

Please feel free to comment with rough time issue started, cloudflared version used, way tunnel is deployed (Docker, VM, Baremetal, etc) I admittedly am not able to directly do 1 on 1 troubleshooting for everyone but I've let people know internally that multiple people started reporting this error recently and linked the Github bug so it will be helpful for gathering information.

14wkinnersley commented 1 month ago

Issue started for me around 20JUL24 00:21 MT is when one of my monitors first detected an issue. I have Cloudflare Tunnels deployed using Docker image cloudflare/cloudflared:latest. Host: Ubuntu 22.04.3. All of my instances are updated to the latest with 2024.6.1 but have been updated since that update was released. I didn't experience issues until recently. I have multiple instances of cloudflare tunnels running that have been running for a few years now. No issues till recently. I attempted to add the flag "--protocol http2" to my docker setup to rule out quic issues but that did not fix anything.

Logs:

2024-07-23T22:09:50Z INF Unregistered tunnel connection connIndex=1 event=0 ip=198.41.200.113
2024-07-23T22:09:50Z WRN Failed to serve quic connection error="timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.113
2024-07-23T22:09:50Z WRN Serve tunnel error error="timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.113
2024-07-23T22:09:50Z INF Retrying connection in up to 1s connIndex=1 event=0 ip=198.41.200.113
2024-07-23T22:09:52Z WRN Connection terminated error="timeout: no recent network activity" connIndex=1
2024-07-23T22:09:55Z INF Registered tunnel connection connIndex=1 connection=061f0765-2e24-421a-98df-dd49bda31254 event=0 ip=198.41.200.63 location=slc01 protocol=quic
2024-07-23T22:21:44Z INF Unregistered tunnel connection connIndex=1 event=0 ip=198.41.200.63
2024-07-23T22:21:44Z WRN Failed to serve quic connection error="timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.63
2024-07-23T22:21:44Z WRN Serve tunnel error error="timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.63
2024-07-23T22:21:44Z INF Retrying connection in up to 1s connIndex=1 event=0 ip=198.41.200.63
2024-07-23T22:21:46Z WRN Connection terminated error="timeout: no recent network activity" connIndex=1
2024-07-23T22:21:54Z INF Unregistered tunnel connection connIndex=2 event=0 ip=198.41.200.73
2024-07-23T22:21:54Z WRN Failed to serve quic connection error="timeout: no recent network activity" connIndex=2 event=0 ip=198.41.200.73
2024-07-23T22:21:54Z WRN Serve tunnel error error="timeout: no recent network activity" connIndex=2 event=0 ip=198.41.200.73
2024-07-23T22:21:54Z INF Retrying connection in up to 1s connIndex=2 event=0 ip=198.41.200.73
2024-07-23T22:21:56Z WRN Connection terminated error="timeout: no recent network activity" connIndex=2
2024-07-23T22:22:02Z INF Registered tunnel connection connIndex=1 connection=53437d5d-008b-46ea-99ef-c9aa8ff8ca6c event=0 ip=198.41.200.23 location=slc01 protocol=quic
2024-07-23T22:22:02Z INF Registered tunnel connection connIndex=2 connection=a74ddd5c-0401-4d47-af87-4751d9cc5001 event=0 ip=198.41.200.43 location=slc01 protocol=quic
clickbg commented 1 month ago

Issue started for us ~4 days ago - 20.07.24. We are using multiple Docker based tunnels distributed across various OSes from Ubuntu to RPi OS. Most of them are 2024.6.1 but some systems are 2024.6.0. Issue is that after a 1-2 days of uptime the tunnel starts disconnecting intermittently - some calls to the backend work and some fail with error code 524 (Timeout). We have noticed that when this happens out of 10 HTTP calls 3 fail. The workaround is to restart the tunnel. We have confirmed that its not a network issue on our side. It also happens to multiple, independent systems that are located in different datacenters. We have observed the issue happening at Hetzner Germany, OVH France and our own DC in Bulgaria on the same day - not the same time. Those providers have direct peering with Cloudflare and have no reported network outages during that time.

Logs:

2024-07-22T19:31:40Z ERR  error="stream 301 canceled by remote with error code 0" connIndex=1 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T19:31:40Z ERR Request failed error="stream 301 canceled by remote with error code 0" connIndex=1 dest=https://REMOVED/apps/theming/favicon?v=0ade7c2c event=0 ip=198.41.200.33 type=http
2024-07-22T19:38:45Z ERR  error="stream 57 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T19:38:45Z ERR Request failed error="stream 57 canceled by remote with error code 0" connIndex=3 dest=https://REMOVED/apps/theming/favicon?v=0ade7c2c event=0 ip=198.41.200.193 type=http
2024-07-22T19:38:45Z ERR  error="stream 313 canceled by remote with error code 0" connIndex=1 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T19:38:45Z ERR Request failed error="stream 313 canceled by remote with error code 0" connIndex=1 dest=https://REMOVED/apps/theming/icon?v=0ade7c2c event=0 ip=198.41.200.33 type=http
2024-07-22T19:38:55Z ERR  error="stream 61 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T19:38:55Z ERR Request failed error="stream 61 canceled by remote with error code 0" connIndex=3 dest=https://REMOVED/apps/theming/favicon?v=0ade7c2c event=0 ip=198.41.200.193 type=http
2024-07-22T21:52:18Z ERR  error="stream 89 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T21:52:18Z ERR Request failed error="stream 89 canceled by remote with error code 0" connIndex=3 dest=https://REMOVED/apps/theming/icon?v=0ade7c2c event=0 ip=198.41.200.193 type=http
2024-07-22T21:52:18Z ERR  error="stream 461 canceled by remote with error code 0" connIndex=1 event=1 ingressRule=0 originService=https://REMOVED-nginx
2024-07-22T21:52:18Z ERR Request failed error="stream 461 canceled by remote with error code 0" connIndex=1 dest=https://REMOVED/apps/theming/favicon?v=0ade7c2c event=0 ip=198.41.200.33 type=http

We are also seeing:

2024-07-22T15:22:31Z WRN Failed to serve quic connection error="failed to accept QUIC stream: timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.23
2024-07-22T15:22:31Z WRN Serve tunnel error error="failed to accept QUIC stream: timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.23
2024-07-22T15:22:31Z INF Retrying connection in up to 1s connIndex=1 event=0 ip=198.41.200.23
2024-07-22T15:22:31Z WRN Connection terminated error="failed to accept QUIC stream: timeout: no recent network activity" connIndex=1
2024-07-22T15:22:33Z INF Unregistered tunnel connection connIndex=3 event=0 ip=198.41.200.13
2024-07-22T15:22:33Z WRN Failed to serve quic connection error="timeout: no recent network activity" connIndex=3 event=0 ip=198.41.200.13
2024-07-22T15:22:33Z WRN Serve tunnel error error="timeout: no recent network activity" connIndex=3 event=0 ip=198.41.200.13
2024-07-22T15:22:33Z INF Retrying connection in up to 1s connIndex=3 event=0 ip=198.41.200.13
2024-07-22T15:22:34Z WRN Connection terminated error="timeout: no recent network activity" connIndex=3
firecow commented 1 month ago

Our production and staging environment went down across two K8S clusters and eight Docker Swarm clusters in three different psychical locations the 19th of July and again this morning (24th of July).

Restarting our cloudflared system services and cloudflared containers helped.

The tunnel metrics /ready endpoint was exiting with code 0, indicating that there was no problems.

@DevinCarr @jcsf This needs to be addressed immidiately.

I have also opened an enterprise support ticket, to make sure this gets some traction.

I don't think this is a problem with the cloudflared binary, since we see it accross a wide array of cloudflared versions.

jfarre20 commented 1 month ago

same here across multiple tunnels.

OcifferAction commented 1 month ago

I utilize a Cloudflare tunnel on my home lab and I initially ran into this issue while on vacation last week. Had to VPN into my home network to restart the tunnel. I'm running a tunnel on Unraid utilizing this Docker repo: https://github.com/AriaGomes/Unraid-Cloudflared-Tunnel. I'm receiving the same error messages as everyone else.

jcsf commented 1 month ago

We are investigating on our side we will let you know once we have more information. Sorry for not having more information to provide right now.

leet4tari commented 1 month ago

Started Monday for us using multiple docker tunnels on version 2024.06.[0/1]

joakimlemb commented 1 month ago

Same issue here, running cloudflared in docker with following config: Debian 12.6/Proxmox with kernel: 6.8.8-2-pve Docker version 27.0.3, build 7d4bcd8

  cloudflare-tunnel:
    image: cloudflare/cloudflared
    hostname: cloudflare-tunnel
    restart: unless-stopped
    mem_limit: 1g
    network_mode: "host"
    user: 1000:1000
    command: tunnel run --protocol http2
    environment:
      - "TUNNEL_TOKEN=REDACTED"
    logging:
      options:
       max-size: "5m"
tamisoft commented 1 month ago

Same here, it started when the larger cloudflare update was rolling out into different datacenters earlier this month. The phenomenon looks like this: tunnel connects to nearest datacenters (hel01,tll01) then randomly there will be: WRN Failed to serve quic connection error="timeout: no recent network activity" connIndex=1 event=0 ip=198.41.200.63 for all connected indices, then the client frees up the unused connections, and retries to connect again. BUT this time it never connects back to the nearest datacenters, lately it'll connect to dme01, dme06, then that fails the same way after a who;e, and then I normally land on rix01 rix01 for both. And at this point the data will move painfully slowly, if at all. As if the client would just ignore the nearest/previously already used datacenters and would be moving away from the physical location. I can imagine if all the clients in the nordic region act the same then rix01 would be pretty unhappy getting all the traffic. And that connection has no second datacenter handle either, so all connection indices will be connected to rix01. I hope this is helpful @jcsf

cageyv commented 1 month ago

Same here. Works perfect in case of fra datacenters But super slow when connected ham01, arn02 Tunnel versions: from 2023.12 to 2024.3 Support ticked was opened

cageyv commented 1 month ago

Looks better now. Tunnels were rerouted to other datacenters. No issues again. Support helps. In case of issues, I could recommend opening a ticket.

daegalus commented 1 month ago

I have cloudflared running directly on a VM handling traffic to docker containers and services on other machines in the network. Over the last few days, I have been bombarded with alerts from my Uptime Kuma of 524 errors and timeouts on my external checkers.

When I check the logs, I just see a sea of stream closed with error 0 messages.

A reboot fixes it for a few hours, then it starts up again.

825i commented 1 month ago

I have cloudflared running directly on a VM handling traffic to docker containers and services on other machines in the network. Over the last few days, I have been bombarded with alerts from my Uptime Kuma of 524 errors and timeouts on my external checkers.

When I check the logs, I just see a sea of stream closed with error 0 messages.

A reboot fixes it for a few hours, then it starts up again.

Exact same issue. Seems this is hitting a LOT of people suddenly and probably thousands more who don't even know it's happening. This ticket should be changed to Priority "HIGH".

danparisd commented 1 month ago

We're seeing a similar issue. 2024.6.1 for all clients for us. Started 7/22/24 for us.

Mika- commented 1 month ago

My tunnel went first time down on 12th and after that it's been working couple of days at a time. Yesterday it only worked couple of hours after restarting so I tested reverting to an older version. Now almost a day with version 2024.4.1 I haven't (yet) seen any issues. Before these issues tunnel had been working without any problems way over a year.

danparisd commented 1 month ago

Our connectors are set to use http2, anyone having this issue using quic?

tholland15 commented 1 month ago

Our connectors are set to use http2, anyone having this issue using quic?

Yes I only use quic and have this issue.

Siwus90 commented 1 month ago

Same here, HA 12.4, Cloudflared version: 5.1.15

jhult commented 1 month ago

Running a NixOS 23.11 VM with version 2024.1.5 directly installed via nix.

We noticed issues as early as July 17 but possibly even a few days earlier.

nperez0111 commented 1 month ago

I've also noticed this issue. Is affecting the stability of my tunnels with no obvious sign of issues, I wish there was a health check so that I could restart my tunnel when I have an issue like this.

jhult commented 1 month ago

@dledfordcf, any news from the inside team(s)?

DevinCarr commented 1 month ago

At this time, the impact should no longer be visible. We had made a change on the edge that caused a small amount of QUIC packets to be routed and dropped for some cloudflared tunnel connections. This is the reason why many of your cloudflared logs mentioned timeouts and remote/local closing the stream connections.

This change has been rolled back and your tunnels should go back to normal without any change on your part.

However, please keep in mind that you may still occasionally see the error message in your cloudflared logs: # stream #### canceled by remote with error code 0. This can happen from a varying set of reasons, such as:

Thank you for your patience as we investigated this.

825i commented 1 month ago

I'm still seeing this problem. I realise that you said we'll still sometimes see it.

Does this really count as a fix then? At least I guess I'll have to wait and see.

alby258 commented 1 month ago

Same for me. The problem is still here on every hour

morpig commented 1 month ago

However, please keep in mind that you may still occasionally see the error message in your cloudflared logs: # stream #### canceled by remote with error code 0.

per your last comment @DevinCarr, is it possible to make it silent/only shown on deeper log levels?

I dont think this is shown at general level on web servers such as nginx (eyeball early disconnects, etc..) please correct me if i'm wrong.

LewisSpring commented 1 week ago

Hi. Still getting this issue quite badly.

2024-08-27T22:37:48Z ERR Request failed error="stream 529 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http
2024-08-27T22:37:48Z ERR  error="stream 533 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=1 originService=http://172.20.0.3:9005
2024-08-27T22:37:48Z ERR Request failed error="stream 533 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http
2024-08-27T22:37:54Z ERR  error="stream 537 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=1 originService=http://172.20.0.3:9005
2024-08-27T22:37:54Z ERR Request failed error="stream 537 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http
2024-08-27T22:37:55Z ERR  error="stream 541 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=1 originService=http://172.20.0.3:9005
2024-08-27T22:37:55Z ERR Request failed error="stream 541 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http
2024-08-27T22:38:01Z ERR  error="stream 545 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=1 originService=http://172.20.0.3:9005
2024-08-27T22:38:01Z ERR Request failed error="stream 545 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http
2024-08-27T22:38:01Z ERR  error="stream 549 canceled by remote with error code 0" connIndex=3 event=1 ingressRule=1 originService=http://172.20.0.3:9005
2024-08-27T22:38:01Z ERR Request failed error="stream 549 canceled by remote with error code 0" connIndex=3 dest=https://example.com/1080p.mp4 event=0 ip=198.41.192.107 type=http

Anything I can provide for further investigation?