Closed keymon closed 7 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/135695553
The labels on this github issue will be updated when the story is started.
If you want, we can work on a patch. But let us know the approach you want to follow to fix it.
Is it OOMing in the staging container or the runtime container?
Runtime.
Also, the app appeared as running, cf events
did not report any problem. There were occasional errors from the gorouter (502) due the gunicorn workers being killed.
The logs of gunicorn showed that the workers were all time being started, which would suggest that something was going on, but you need to have context to troubleshoot that and know that they were being killed by OOM.
We learnt it from the kernel messages.
@keymon Not really something we can address in the buildpack per se, as this code is only present and executed at staging time.
@ematpl who should we refer to in runtime land? Garden team? Diego? Mystery-OOMing has been a thing as long as I can recall.
@keymon Not really something we can address in the buildpack per se, as this code is only present and executed at staging time.
No really, as we could configure gunicorn in a different way. As mention, add a hook on the worker exit, or configure the gunicorn master to exit if the worker dies.
As it happens, Python buildpack is the only official buildpack that doesn't provide a launch command automatically. It always has to be user-supplied.
This is due to a decision by the Heroku upstream maintainer -- we don't agree, but as yet haven't decided to change our behaviour to provide default commands.
Somewhere in your deployment logic you must be using either an application manifest.yml
, or passing the -c
param to the cf
CLI. Or, I suppose, doing something in a .profile
script. As a start, that's where reconfiguring gunicorn would take place.
We could definitely update the doc examples, though. Can you give an example of what you want for gunicorn? We are regretfully shallow on python expertise at this end.
@keymon any thoughts on @jchesterpivotal's comments?
Sorry, I did not come back to you earlier. Let me discuss it with my team and I will come back to you tomorrow.
We have decided not to pursue this as it turns out CloudFoundry has built-in support for monitoring OOM events in containers, so we are going to focus on exposing this to tenants rather than modifying the buildpack to make it more obvious. Feel free to close this issue!
Thanks for the discussion in Slack about this matter, @Jonty. I've captured the request for better messaging about OOM events in https://www.pivotaltracker.com/story/show/136243763 in the Diego team's tracker.
Best, Eric
Cool @ematpl!
@Jonty also did a possible implementation, https://github.com/cloudfoundry/executor/pull/20. Could you refer to it in your card as a comment?
Thx
Thanks, @keymon, I added a link to @Jonty's PR in the description of that story.
as it turns out CloudFoundry has built-in support for monitoring OOM events in containers
@Jonty could you share what this built-in support is, and how it could be shared? I cannot find it in the docs.
@vanschelven Apologies, it's been 7 years since I last looked at this and I can't remember anything about it!
v245
python_buildpack-cached-v1.5.10.zip
When running an application that uses python gunicorn with very little memory, the gunicorn workers get killed by the kernel OOM. That is fine and expected.
But the gunicorn master is monitoring those workers and restarts them immediately, which is fine.
The problem is that this restart event is never logged properly, or detected by garden+rep, so you cannot easily visualise what is happening.
cf events
does not reflect any change in the container, because the master gunicorn never gets restarted,cf logs
only shows the messageERR [2016-12-07 11:45:54 +0000] [31798] [INFO] Booting worker with pid: 31798
which does not give too much info of what is going on.To get more information about what is actually happening. We could log that the worker has been killed and restarted, if possible explaining that was due a Out of Memory.
Maybe this can be implemented with hooks/signals http://docs.gunicorn.org/en/stable/signals.html
I also wonder: