A few changes to try and let the scheduler scale better:
Flask worker is now async (with gevent). This allows for up to 10x the number of simultaneous queries, and should let "fast" queries bypass "slow" queries.
Postgres settings tuned. I've dialed up memory settings everywhere to push the database to use that memory on the m5.8xlarge — the defaults are REALLY low.
Extra postgres metrics. Now with table sizes. If things break again we can cross-compare these metrics with memory sizes tuned above. If a table size crosses a memory threshold and things break at the same time, that'll tell us where to look next. There are also some counters for index usage which I hadn't noticed before, which tells us how to improve our indexes. :)
I was having some difficulty with crash looping that seemed to be unique to me. I haven't done anything to directly fix this, but poking at values in the task definition has somewhat mitigated it. I don't think anything there should cause problems, but if it suddenly causes problems in your setup we'll roll it back.
A few changes to try and let the scheduler scale better:
I was having some difficulty with crash looping that seemed to be unique to me. I haven't done anything to directly fix this, but poking at values in the task definition has somewhat mitigated it. I don't think anything there should cause problems, but if it suddenly causes problems in your setup we'll roll it back.