Closed simonhatch closed 5 years ago
@dave-2 worked with me on debugging an outage we had last year. My recollection:
The only reference I can find to this is https://codereview.chromium.org/2389613003 -- it doesn't reference the bug.
Backing up, you've made lots of changes to /add_point over the time since that happened, so there's no guarantee changing machine specs would trigger this. But maybe we should look into ways to monitor memory usage, and experiment from there?
Following that link, there's a reference to another CR, which specifies a bug: crbug.com/648633
Sounds like bisect specified the lowest possible instance class type, which is 128 mb which I could easily believe would blow out /add_point on larger uploads. The raw chartjson alone could be a not-insignificant % of that.
Yeah I'd agree having some monitoring of memory usage in place would be a good first step here. I have no immediate plans to pursue this, just filed it as I thought of it.
I definitely think it would be awesome to move forward here! Just wanted to note that we did have some issues before, and I didn't have the full history.
Pinpoint also uses 1G instances, which I recall was because generating results2 consumed a lot of memory. Since we use a stream now, can we reduce those?
Yeah you might be ok dropping the instance class down quite a bit now on Pinpoint.
We currently use the largest possible instance, but that may not be necessary anymore since we removed the monitored property.
Ie. that the costly update in /update_test_suites is probably well under the limit for a 512 or even 256 instance.
I'm not sure of any other places in code we do big updates like that that may have OOM'd on smaller instances in the past.
@eakuefner @anniesullie