Web loading is extremely slow

acatxnamedvirtue commented 5 years ago

Hello there!

We use control-tower to deploy our concourse instance to AWS and we absolutely love it. However, as we add more jobs and more pipelines, we are experiencing super slow page load times.

We're currently deploying with these flags: --iaas aws --region us-east-1 --workers 4 --worker-type m5 --worker-size 4xlarge --web-size 2xlarge

And even though the web-size is 2xlarge, it's still very slow (3-6s page load times). From looking in the network tab, this is mostly coming from the "pipelines" and "jobs" calls. We could split pipelines out to separate teams, but since we're a fairly small company (100ish engineers) we appreciate the pipeline visibility, especially during on-call rotations where quickly redeploying a last known version is helpful. We could also start spinning up new control-tower concourse deployments to various sub-domains, but that's a little annoying from a management perspective.

We're wondering if you have any insight into this, or if you are planning on bumping up the options for maximum web node size (t3's would be particularly nice, but beefier instances would be great, too), or maybe it's just time for us to figure out the BOSH deployment on our own :)

Thanks for your help!

Tyler Beebe Software Engineer Meetup

DanielJonesEB commented 5 years ago

Hi @acatxnamedvirtue, thanks for reaching out.

Right now we don't have any plans to add new instance types, although it doesn't sound like a massive change (famous last words; it gets tricky when instance types aren't supported in all zones).

The latest release of Control Tower has an improved metrics dashboard courtesy of @gerhard. It doesn't yet show database metrics, but maybe this might help dig deeper into where the issues are arising?

gerhard commented 5 years ago

@acatxnamedvirtue it's most likely network latency / network throughput / disk IOPS on the db. It may be CPU contention on the web instances, but this is less likely. Without metrics, it's just a guess.

This is a real-world example of the Concourse Dashboard that @DanielJonesEB mentioned. In RabbitMQ's Concourse case, the db is constantly averaging ~115Mbps of outgoing network traffic (bursts of up to 225Mbps). My suspicion is that this is your bottleneck, especially if you are using a managed DB instance.

API Response Duration is also worth looking at:

crsimmons commented 5 years ago

It's worth noting that our implementation of Gerhard's dashboard doesn't have the DB metrics graphs because we use RDS/CloudSQL rather than a BOSH deployed vm. You may be able to figure out the DB network metrics via the IaaS though.

acatxnamedvirtue commented 5 years ago

Ah thanks for the pointers! I was able to take this and try a couple things out. Things I learned:

Throwing bigger machines at the web did nothing to help, like you said.
I redeployed with the newest control-tower and got access to that sweet new dashboard (Thanks @gerhard , this is the dashboard I've been wanting for ages!) API Duration was wild, oscillating between 10ms and 10s for all calls.
I reached out in concourse's discord for some help, and saw a similar message about someone looking for ideas on how to optimize performance, so I decided to give bumping up the DB machine a shot. I figured I'd do it manually first via the AWS console, so I went in and saw a message that said something like this: "Provisioning less than 100GiB may cause poor IOPS performance" So I did two things, I bumped the db storage size up to 100GiB and also bumped the machine type up to a m5.large.

Here's the performance increase I saw in the API duration almost immediately afterwards:

The web ui is SO FAST NOW, which is super exciting!!

I then decided to scale the machine back down to see if it was the machine size or the storage size.

I'm back down to running the default db.t2.medium, but still with 100 GiB storage, and still seeing zippy API durations.

My recommendation to ya'll might be to make the storage size of the DB a setting you can change via deploy flag, or to make the default 100GiB. I'm pretty sure the next time I do a control-tower deploy, terraform will scale it back down (or maybe even delete it? I'm new-ish to terraform).

Anyway, for now I've solved our main problem, and I'm super excited. Thanks for all of your help! Feel free to close this Issue at your convenience.

acatxnamedvirtue commented 5 years ago

Here's something from the concourse discord, wish I had known haha:

gerhard commented 5 years ago

@acatxnamedvirtue glad that you managed to pinpoint the root cause of your slow page loads!

I plan on rolling out node-exporter in our infra and replacing this Concourse dashboard with a Prometheus-based one, especially after https://github.com/concourse/concourse/pull/4247#issuecomment-527156575 gets merged. With the node-exporter, it would be easy to show host metrics such as disk IOPS alongside Concourse metrics, which would have made this issue which you've hit trivial to spot. Eventually, I would really like to see IaaS thresholds (e.g. disk IOPS limits) integrated in this new dashboard, so that we can have something like this (replace Memory available before publishers blocked with Disk IOPS available):

EngineerBetter / control-tower

Web loading is extremely slow #38