Dashboard - Investigate using smaller instance type

catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..

https://chromium.googlesource.com/catapult

BSD 3-Clause "New" or "Revised" License

1.91k stars 563 forks source link

Dashboard - Investigate using smaller instance type #4456

Closed simonhatch closed 5 years ago

simonhatch commented 6 years ago

We currently use the largest possible instance, but that may not be necessary anymore since we removed the monitored property.

Ie. that the costly update in /update_test_suites is probably well under the limit for a 512 or even 256 instance.

I'm not sure of any other places in code we do big updates like that that may have OOM'd on smaller instances in the past.

@eakuefner @anniesullie

anniesullie commented 6 years ago

@dave-2 worked with me on debugging an outage we had last year. My recollection:

At that time, the perf dashboard did not specify hardware configs
The default config changed from the beefiest to the leanest.
We pushed a new perf dashboard, and started getting OOMs in /add_point
Reverting to the old version worked, because the machine stats were somehow baked into the old version.
@dave-2 figured out the above, set the machine stats, and everything went back to normal.

The only reference I can find to this is https://codereview.chromium.org/2389613003 -- it doesn't reference the bug.

anniesullie commented 6 years ago

Backing up, you've made lots of changes to /add_point over the time since that happened, so there's no guarantee changing machine specs would trigger this. But maybe we should look into ways to monitor memory usage, and experiment from there?

simonhatch commented 6 years ago

Following that link, there's a reference to another CR, which specifies a bug: crbug.com/648633

Sounds like bisect specified the lowest possible instance class type, which is 128 mb which I could easily believe would blow out /add_point on larger uploads. The raw chartjson alone could be a not-insignificant % of that.

Yeah I'd agree having some monitoring of memory usage in place would be a good first step here. I have no immediate plans to pursue this, just filed it as I thought of it.

anniesullie commented 6 years ago

I definitely think it would be awesome to move forward here! Just wanted to note that we did have some issues before, and I didn't have the full history.

dave-2 commented 6 years ago

Pinpoint also uses 1G instances, which I recall was because generating results2 consumed a lot of memory. Since we use a stream now, can we reduce those?

simonhatch commented 6 years ago

Yeah you might be ok dropping the instance class down quite a bit now on Pinpoint.

benshayden commented 5 years ago

Moved to https://bugs.chromium.org/p/chromium/issues/detail?id=917914