argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.47k stars 5.32k forks source link

ArgoCD very slow on inial load after upgrade to 2.3.3 from 2.2.5 #9333

Open bo0ts opened 2 years ago

bo0ts commented 2 years ago

Describe the bug

After the upgrade from 2.2.5 to 2.3.3 the ArgoCD webinterface takes a very long time to load. We have two ArgoCD instances with a small amount of apps (~50 per instance) and both exhibit this behavior. When looking at the network traffic in a developer console the main problem seems to be main.c7ea22e999b3805bc676.js.

I could reproduce the problem on Chrome and Firefox.

None of the pods running ArgoCD show signs of resource starvation.

Screenshots

image

Version

We deploy ArgoCD using the ArgoCD Operator on OKD 4.9. The ArgoCD 2.3.3 Update was triggered by the Operator Update to version 0.3.0.

argocd: v2.3.3+07ac038
  BuildDate: 2022-03-30T00:06:18Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
Maxcoder-net commented 2 years ago

HI

Since when is this slowness found? Is it a daily slowness for AgroCD? What about other users ?, are they also experiencing same slowness?

bo0ts commented 2 years ago

@Maxcoder-net Like I said, since the upgrade to 2.3.3 from 2.2.5. The problem appears when doing a full reload, but not while navigating in the app. It happens for every user on Firefox and Chrome (we haven't tested other browsers).

viggin543 commented 2 years ago

Happenes to me 2 ! image

thnk2wn commented 2 years ago

Maybe it's a different issue but I'm seeing the same behavior with v2.2.2+03b17e0. If I haven't hit the website in a bit it can take up to a minute to load. Once I'm there it's fast doing what I need.

crenshaw-dev commented 2 years ago

The differences in the timings is surprising. @bo0ts sees 9s, and @viggin543 sees 3min.

This is gonna be pretty difficult to diagnose. It involves the API server, potentially an ingress, the network between that and your client, the client itself... A packet trace and/or logs from each of those components would be a start.

thnk2wn commented 2 years ago

Appears to be main.js for myself as well, can try to look into logs...

2022-06-21_15-18-02

crenshaw-dev commented 2 years ago

This is reproducible on every refresh (or maybe hard refresh)? Do you happen to know if the API server is under heavy CPU and/or network load?

thnk2wn commented 2 years ago

This is reproducible on every refresh (or maybe hard refresh)? Do you happen to know if the API server is under heavy CPU and/or network load?

I thought it was only when logged out but it does seem to be the same amount of time on a page refresh (normal or hard).

I didn't setup Argo and am not too familiar with its internals but if this helps:

2022-06-21_15-29-12

It's also strange as some members of our team aren't seeing it. They're on the west coast, I'm on east, deployed to us-east-1

thnk2wn commented 2 years ago

A log file if it helps: argocd-server-69fdcc9dc8-cjmkv.log. Not sure what other component logs or steps would help.

crenshaw-dev commented 2 years ago

Low load. Looks like JS requests aren't logged, so that's not really gonna help us tell whether the API server is to blame. The fact that it's different in different locations makes me want to blame the network, but I don't want to jump to conclusions.

I think my next step would be running a packet trace on the client and on the API server. If the clocks are reasonably well synced, you can compare when the data packets arrive on the client vs. when the ACK packets make it back to the server to tell whether the network is to blame.

Thinking back to my web perf days, I guess it's also possible that a lot of packets are being dropped, forcing the packet size to stay super low.

bo0ts commented 2 years ago

@crenshaw-dev I would be happy to help but this issue randomly disappeared in our instances. API Timings across all involved clusters did not change during that time. Networking in all other deployed components was healthy meanwhile. Sorry :(

thnk2wn commented 1 year ago

This was mostly fixed for me by turning on gzip compression (not sure why it wasn't on by default): https://github.com/argoproj/argo-cd/discussions/10238#discussioncomment-3942411

It can still take 6-9 seconds or so to initially load the page but much better than a minute.

crenshaw-dev commented 1 year ago

Yeah, that's still absurdly slow. But the fact that gzip helped is interesting.

jlongo-encora commented 1 year ago

Any updates on this one? I'm able to reproduce it on v2.4.14

crenshaw-dev commented 1 year ago

@jlongo-encora I think we need more details about the problem. So far, I think it's unclear whether Argo CD is misbehaving (sending stuff over the network too slowly) or if network conditions are the problem (or some combination of both).

One helpful piece of information would be seeing which assets are taking so long to transfer and at what rate they're being transferred. We could compare transfer rate of, say, the main JS bundle to some other network response.

jlongo-encora commented 1 year ago

@crenshaw-dev

image

My internet is fast to open other sites

crenshaw-dev commented 1 year ago

This is wild. The tiny extensions.js, a static asset, takes 9s. Two different userinfo responses have very different reaponse times.

Is this API server pod under a lot of load? Is it possible CPU throttling is monkeying with response times?

jlongo-encora commented 1 year ago

@crenshaw-dev we have ~350 applications. I'm not sure about the CPU thing. I'm using Chrome. Let me try with another browser

crenshaw-dev commented 1 year ago

Actually, extensions.js isn't quite static. So it could be a victim of throttling.

@jlongo-encora throttling won't be a direct function of number of applications. It happens server-side rather than on your browser. When a Kubernetes Pod uses more CPU than its configured limits, Kubernetes will "pause" CPU activity to avoid letting the Pod take too much time from other Pods on the node.

jlongo-encora commented 1 year ago

@crenshaw-dev ok thx for the explanation

crenshaw-dev commented 1 year ago

I think kubectl has the ability to show CPU usage. I always use Grafana, which my team set up to monitor our stuff. Unfortunately I don't know the details of that setup.

thilinajayanath commented 2 months ago

I had the same issue where a 600-ish KB main.cxxx....js file was taking over 50 seconds to load and it was fixed after I cleared the cache.