Open acatxnamedvirtue opened 5 years ago
Hi @acatxnamedvirtue, thanks for reaching out.
Right now we don't have any plans to add new instance types, although it doesn't sound like a massive change (famous last words; it gets tricky when instance types aren't supported in all zones).
The latest release of Control Tower has an improved metrics dashboard courtesy of @gerhard. It doesn't yet show database metrics, but maybe this might help dig deeper into where the issues are arising?
@acatxnamedvirtue it's most likely network latency / network throughput / disk IOPS on the db. It may be CPU contention on the web instances, but this is less likely. Without metrics, it's just a guess.
This is a real-world example of the Concourse Dashboard that @DanielJonesEB mentioned. In RabbitMQ's Concourse case, the db is constantly averaging ~115Mbps of outgoing network traffic (bursts of up to 225Mbps). My suspicion is that this is your bottleneck, especially if you are using a managed DB instance.
API Response Duration is also worth looking at:
It's worth noting that our implementation of Gerhard's dashboard doesn't have the DB metrics graphs because we use RDS/CloudSQL rather than a BOSH deployed vm. You may be able to figure out the DB network metrics via the IaaS though.
Ah thanks for the pointers! I was able to take this and try a couple things out. Things I learned:
Here's the performance increase I saw in the API duration almost immediately afterwards:
The web ui is SO FAST NOW, which is super exciting!!
I then decided to scale the machine back down to see if it was the machine size or the storage size.
I'm back down to running the default db.t2.medium, but still with 100 GiB storage, and still seeing zippy API durations.
My recommendation to ya'll might be to make the storage size of the DB a setting you can change via deploy flag, or to make the default 100GiB. I'm pretty sure the next time I do a control-tower deploy, terraform will scale it back down (or maybe even delete it? I'm new-ish to terraform).
Anyway, for now I've solved our main problem, and I'm super excited. Thanks for all of your help! Feel free to close this Issue at your convenience.
Here's something from the concourse discord, wish I had known haha:
@acatxnamedvirtue glad that you managed to pinpoint the root cause of your slow page loads!
I plan on rolling out node-exporter in our infra and replacing this Concourse dashboard with a Prometheus-based one, especially after https://github.com/concourse/concourse/pull/4247#issuecomment-527156575 gets merged. With the node-exporter, it would be easy to show host metrics such as disk IOPS alongside Concourse metrics, which would have made this issue which you've hit trivial to spot. Eventually, I would really like to see IaaS thresholds (e.g. disk IOPS limits) integrated in this new dashboard, so that we can have something like this (replace Memory available before publishers blocked with Disk IOPS available):
Hello there!
We use control-tower to deploy our concourse instance to AWS and we absolutely love it. However, as we add more jobs and more pipelines, we are experiencing super slow page load times.
We're currently deploying with these flags: --iaas aws --region us-east-1 --workers 4 --worker-type m5 --worker-size 4xlarge --web-size 2xlarge
And even though the web-size is 2xlarge, it's still very slow (3-6s page load times). From looking in the network tab, this is mostly coming from the "pipelines" and "jobs" calls. We could split pipelines out to separate teams, but since we're a fairly small company (100ish engineers) we appreciate the pipeline visibility, especially during on-call rotations where quickly redeploying a last known version is helpful. We could also start spinning up new control-tower concourse deployments to various sub-domains, but that's a little annoying from a management perspective.
We're wondering if you have any insight into this, or if you are planning on bumping up the options for maximum web node size (t3's would be particularly nice, but beefier instances would be great, too), or maybe it's just time for us to figure out the BOSH deployment on our own :)
Thanks for your help!
Tyler Beebe Software Engineer Meetup