Add resource limits for diffing server

Mr0grog commented 6 years ago

During startup, the diffing server should be able to read an argument or environment variable to set limits on resource usage (or working theory on lockups we’ve seen is that some diff requests cause the process to eat all a machine’s available memory). See https://docs.python.org/3/library/resource.html#resource.setrlimit

danielballan commented 6 years ago

Potentially useful for this: https://pypi.org/project/superlance/

danielballan commented 6 years ago

I think the right approach here is two layers:

Put a limit on the maximum number of bytes of content we will download. The requests supports this via the stream=True parameters and the iter_content(chunk_size) method on responses. This limit will ensure that one giant request can't knock over an otherwise happy server.
Use a second process to monitor the memory usage of our server process (a la superlance, above) and kill it if it gets too greedy. This will ensure that a server overwhelmed with many requests can't consume so much system memory that, say, ssh breaks.

Using resource is apparently not a standard way to do this. Its behavior is platform-dependent, and trusting an overburdened process to police itself is obviously problematic.

jsnshrmn commented 6 years ago

Use a second process to monitor the memory usage of our server process (a la superlance, above) and kill it if it gets too greedy. This will ensure that a server overwhelmed with many requests can't consume so much system memory that, say, ssh breaks.

Alternatively (or additionally), you could just do a simple check of the system state before job execution. I actually have a super simple shell wrapper I've written to prevent cron jobs from burning up burst credits on aws t2 instances. Let me see if I can turn that code up as an example.

danielballan commented 6 years ago

On the dev call, we concluded that:

The "single giant request" failure mode (1) is rare but has happened and is easy to address, so we should do it.
The "overwhelmed by many reasonably-sized requests" failure mode is common, and we should address it by putting requests in a queue with a bounded (and configurable) number of workers. Something along the lines of (2) is still worth doing as a second layer of defense.

Mr0grog commented 6 years ago

Alternatively (or additionally), you could just do a simple check of the system state before job execution.

FWIW, it think that’s a great approach for a job-based system. But this is a single long-running HTTP server process.

jsnshrmn commented 6 years ago

ah, right, supervisor fires up the processes and they stay up as long as they can. I haven't poked around the code for the webservers yet, but it sounds like I should.

a little bit of a digression with a point: systemd knows how to natively handle python code as a service. There are some nice benefits to having your code act directly as a systemd service since there are robust tools for managing them as long processes, including configuring resource limits. This seems like a good fit for this, so that there's not another application with its own idiosyncrasies added into the mix.

Could this problem be solved by dropping supervisord altogether and just writing some systemd service config files that include resource limits? Am I missing something crucial that supervisord is doing for us?

jsnshrmn commented 6 years ago

@danielballan you can see some example ansible templates here: https://github.com/OULibraries/ansible-role-celery-worker/tree/master/templates The examples don't set resource limits, but do fire up python processes that work within an activated virtual environment.

Mr0grog commented 6 years ago

Note the diff server has fallen over again, likely due to some combination of the issues here and #169 (that one’s on me).

Mr0grog commented 6 years ago

Related note: we should implement health/status check endpoints for our services. I thought we’d talked about that before, but can’t find an issue for tracking it. It would help other tools notify about problems or auto-restart (e.g. Kubernetes can recreate a node when it doesn’t respond).

Mr0grog commented 6 years ago

^ will make relevant issues if people are down with that

danielballan commented 5 years ago

Yes, please go for it. I'm not aware of any open issues. I agree we discussed it.

Mr0grog commented 5 years ago

Bumping this up for the simple “max request size” case since the differ fell over again this morning, and from EC2 monitoring, it looks like that could have been the culprit:

screen shot 2018-09-26 at 8 44 44 am

Mr0grog commented 5 years ago

Had another incident related to this today. @jsnshrmn did most of the leg-work in resolving it, while I tracked down the cause.

See #287 for some simple checking we should do before some diff algorithms
@jsnshrmn is going to write up a PR for imposing resource limits in our Kubernetes configs, which should also help prevent the worst case scenario we hit today: because everything is on each node, having the diff server go nuts like this now doesn't just take down the diff server; it takes down the whole cluster!

jsnshrmn commented 5 years ago

@danielballan @Mr0grog Are resource limits here still relevant in light of our ability to limit things via Kubernetes? We could spend more time improving the scheduler config in kubernetes or spend some time implementing better behavior here.

Mr0grog commented 5 years ago

I think @danielballan’s comments earlier are still accurate:

Put a limit on the maximum number of bytes of content we will download…

…trusting an overburdened process to police itself is obviously problematic.

So, I think there’s clearly still work to be done on limiting the number of bytes we load in a resource we are going to diff. (should we create a separate issue for that, change this issue’s name?) See also the discussion in #293.

And on the generalized side, Dan’s right that the process shouldn’t police itself, so in our deployment, that should probably be Kubernetes. I don’t think we’ve seen the differs lock up completely since we implemented liveness probes, so I think we might have have done a decent enough job there. Is there more you think we could or should do in the Kubernetes config that would be worth the effort?

edgi-govdata-archiving / web-monitoring-processing

Add resource limits for diffing server #154