argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.84k stars 3.17k forks source link

3.5 UI infinite SSE creation + timeout on Chrome #12626

Closed gpamit closed 6 months ago

gpamit commented 7 months ago

Pre-requisites

What happened/what did you expect to happen?

Argo WF GUI is not stable and not responsive. Most of the time its slow to load pages and running into error when click to show templates or workflows:

404 failed (timeout: https://[Link to ArgoWF]/monaco-editor.51a434cf513f802ff42b.js))
Stack Trace
ChunkLoadError: Loading chunk 404 failed.
....

This issue is frequent in Chrome, on the other hand Firefox is loading everything fine most of the time.

Expected: Page should load without any error.

Versions:

Chrome: 121.0.6167.139 (Official Build) (x86_64)
Firefox: 122.0 (64-bit)

Version

3.5.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

NA

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

NA

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

NA
agilgur5 commented 7 months ago
404 failed (timeout: https://[Link to ArgoWF]/monaco-editor.51a434cf513f802ff42b.js))
Stack Trace
ChunkLoadError: Loading chunk 404 failed.

So this is specifically after the code-split of Monaco Editor in #12150. But if you're getting a timeout on that, you would've gotten a timeout on every page before that PR.

This issue is frequent in Chrome, on the other hand Firefox is loading everything fine most of the time.

This sounds like possibly an issue with your device as Firefox is more performant than Chrome.

What are the specs of your device? Is it low-end? If you have NPM installed, can run npx envinfo --system --browsers to provide that information quickly in an easy-to-read format.

Also how performant is your network connection -- what is your download mbps?

I can't reproduce this and the UI has a lot of usage, so there isn't much actionable in the issue as-is. If this is due to low end specs or poor network connection, continuing with #12059 will help with that, but unfortunately we can't split Monaco further than it already is. It's huge (it is single-handedly larger than the rest of the UI, including all other deps). The only other thing to do would be to replace it with a smaller and more efficient dep (which I did mention in the issue)

gpamit commented 7 months ago

@agilgur5 Thank you for looking into this. I can confirm this is problem for all my colleagues.

Here is the system information

 System:
    OS: macOS 14.3
    CPU: (12) x64 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
    Memory: 6.18 GB / 32.00 GB
    Shell: 5.9 - /bin/zsh
  Browsers:
    Chrome: 121.0.6167.139
    Safari: 17.3

I got decent download speed 150 MBPS. Since its happening with everyone in my company, it seems like a bug.

agilgur5 commented 7 months ago
    CPU: (12) x64 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
    Memory: 6.18 GB / 32.00 GB

I got decent download speed 150 MBPS.

Huh, you're definitely on a (very) performant machine and performant network connection, more performant than my own in both cases too, where I can't reproduce this.

Since its happening with everyone in my company, it seems like a bug.

But it's not happening to all users (otherwise it would have been reported much earlier and have many +1s; also other contributors and I would've noticed).

If it's everyone in your company, I wonder if it's a VPN or proxy issue? Or if it's the configuration of your Ingress in front of the Server? (for example, before I set-up SSO, I had used oauth2-proxy as a workaround, and that was a dramatic performance bottleneck) Or, for that matter, any other proxies in between the connection from the client to the Server?

bradleyboveinis commented 7 months ago

My colleagues and myself started experiencing this error today.

Win 11 pro, 13900k, 128gb ram, 1gb down connection. Chrome Version 121.0.6167.140 (Official Build) (64-bit) Instance is accessed through teleport vpn.

So far it only occurs for us on 1 instance, other team instances appear fine.

agilgur5 commented 7 months ago

started experiencing this error today.

The timeout? Did you do an update recently? What version are you on? A lot more information and context would be helpful.

Again there isn't really anything actionable in the issue as-is. The one timeout error reported so far is for a dependency (Monaco) which is already as split as it can be, and so is no longer in Argo's control to shrink.

So far it only occurs for us on 1 instance, other team instances appear fine.

Yes that would further suggest that a specific configuration is causing this and not a generic "UI is slow". If that configuration is supposed to be performant but is actually slow, then a reproducible config environment would be the minimum necessary to debug anything or have something actionable. The specific pages and API calls that are slow would be additionally very helpful in such a case.

bradleyboveinis commented 7 months ago

We've been on v3.5.2 for about 2 months. Our first report of the timeout issue came through today. We're still working on trying to figure out what may be the cause as it's hard to reproduce. Previously it was consistently breaking in chrome, however it now appears to be occasionally working.

What I have observed, is that it only appears to happen in chrome for me, but works fine in FireFox. There appears to be stuck (pending) calls in chrome

e.g.:

https://<uri>/api/v1/workflow-events/<ns>?listOptions.fieldSelector=metadata.namespace=mercury&listOptions.resourceVersion=1249507011&fields=result.object.metadata.name,result.object.metadata.namespace,result.object.metadata.resourceVersion,result.object.metadata.creationTimestamp,result.object.metadata.uid,result.object.status.finishedAt,result.object.status.phase,result.object.status.message,result.object.status.startedAt,result.object.status.estimatedDuration,result.object.status.progress,result.type,result.object.metadata.labels,result.object.metadata.annotations,result.object.spec.suspend

image

image

FF on the other hand looks to be using polling, whereas chrome appears to be trying to maintain a number of persistent connections.

At this point, I suspect it could be teleport vpn or another security feature preventing too many persistent connections from being opened.


Update:

Once I figured this out, I was able to replicate it to other instances previously thought to be unaffected:

It appears as though there is a hard limit of 6 connections per host in chrome, the UI hits that limit after about a minute or so of viewing the workflows tab. This then leads to the slowness/timeouts when trying to navigate to other tabs such as the workflow templates tab. I believe this is because subsequent requests are queued and sit in the pending state in perpetuity (as chrome keeps all 6 connections to the host open).

We also tested this and were able to replicate the behavior after removing teleport from the equation to rule out the possibility it is causing the issue.

bradleyboveinis commented 7 months ago

To replicate:

  1. In chrome, open network tab
  2. Log into argo and navigate to /workflows
  3. Wait until you have 6 persistent connections open. You should see something like this: image
  4. Try to navigate elsewhere (workflow templates)

Outcome: Observe the requests in the network tab enter a pending state, but never get processed due to the 6 open connections, timing out eventually. UI updates with the 404 as per the original post.

stefansedich commented 7 months ago

@agilgur5 I work with @bradleyboveinis and am adding some more info here:

  1. I ruled out our proxy Teleport by directly port-forwarding to the argo-server pod and still see the issue
  2. I downgraded to 3.4.x and could not replicate the issue

On more caveat in our environment is we run with --secure=false which results in the server serving us http/1 which would result in the lower browser SSE connection limits and potentially explain the issue we are seeing.

I know the UI changed between 3.4.x and 3.5.x to move to the unified workflows view, did this change how the SSEs are managed and how many may be used when viewing workflows list?

agilgur5 commented 6 months ago

Update here, I seem to have found the main root cause in https://github.com/argoproj/argo-workflows/issues/12663#issuecomment-1947801740. That issue seems duplicative, or at least the root cause seems identical, but I'm not sure if the symptoms are exactly the same without more information from the user. I'll have a fix out shortly that should be out in the next patch release. EDIT: Fixed by #12672

Note that as I wrote there I was only able to partially reproduce this when enabling pagination and moving between pages. I was not able to reproduce this when doing nothing. I also only got two SSEs per page move, that would mostly add up on each other, but would never be more than two per page move. The root cause is likely the same, although I'm not sure how this infinite loop occurred as I couldn't repro it. I'm also not sure if this is the same as OP's issue -- network saturation or browser connection limits could cause any networking to slow down (or grind to a halt), but those wouldn't happen on the Workflow Details page/editing sidebar. But yes, if they were on the Workflows List page first and then moved to those, the SSEs may have persisted (per the comment I linked, I couldn't figure out how they were persisting as the code specifically stops them and I confirmed it was running without errors).

Big thanks for all the details you both added here @bradleyboveinis and @stefansedich , as well as @alelapi in #12663! Those were all vital to partially reproducing it and figuring out what was going on 🙂

agilgur5 commented 6 months ago

Responding to some questions & comments below:

It appears as though there is a hard limit of 6 connections per host in chrome

Yea, that's a super rare limit I've occasionally hit into, usually only when you have an app that works with many tabs. There should only be one or two open connections for the Argo UI, so that was surprising to see. Makes sense based on the rest though -- nice job noticing that!

  1. Log into argo and navigate to /workflows
  2. Wait until you have 6 persistent connections open. You should see something like this:

This I could not fully reproduce, as I was only able to have two connections open at a time and only when moving between pages. But some old ones would remain after moving pages, causing a very similar effect of eventually hitting the connection limit etc.

On more caveat in our environment is we run with --secure=false which results in the server serving us http/1 which would result in the lower browser SSE connection limits and potentially explain the issue we are seeing.

The SSEs not cancelling for some reason seems to be latency sensitive, so your usage might also have made the non-cancellation happen more frequently.

I know the UI changed between 3.4.x and 3.5.x to move to the unified workflows view, did this change how the SSEs are managed and how many may be used when viewing workflows list?

It did not, but I also landed a large refactor to that page in 3.5.0 in #11891. That PR actually fixed a few subtle bugs and optimized a bunch, including significantly reducing the network activity as there were many unnecessary / duplicative requests being made before I refactored it (e.g. many list requests even though there's a ListWatch already, new lists and ListWatches when no parts of the request changed, etc). Those optimizations may have actually made the partly latency-sensitive non-cancellation happen more frequently ironically -- the previous version may have had enough UI latency that it wouldn't hit the non-cancellation issue 🙃

Unfortunately, I've also found 3 less than one-liner bugs in that refactor, including this one. One was a typo (https://github.com/argoproj/argo-workflows/issues/12663#issuecomment-1942137704), and then this one and the other one (#12562) were super nuanced, both being ref issues (recursive ref here, stale ref there). The other one also had historical codebase context/non-React usage that I didn't know about and this one had the infinite loop that I haven't been able to repro as well as the SSEs not being cancelled despite the UI cancellation code running. Really disorienting bugs to root cause, especially when the fixes are only like 10 characters long 😅

I also missed them in testing as they don't always pop up, requiring a certain configuration and potentially a race too. We definitely could use a lot more automated UI tests, though typically networking is mocked in those, so even that still might not catch these kinds of issues. It may require E2E UI tests to catch, but we already have quite a lot of E2E Controller, API, and CLI tests that can take some time to run (~10-25 min per test suite), so I'm a bit hesitant to continue adding E2Es specifically 😕

agilgur5 commented 6 months ago

Regarding root causing and reproducing the infinite loop, I'm curious if there's maybe a stale cache or something causing that? That was something I was looking for more info on in https://github.com/argoproj/argo-workflows/issues/12663#issuecomment-1944425914. In local dev, I clear my caches with some frequency, so I wonder if that's why I haven't been able to repro it.

In particular would be localStorage that is used for initialization of the list filters, but cookies and other caches could have some effect too. For anyone here, does this still reproduce after clearing all caches?