cloudfoundry / python-buildpack

Cloud Foundry buildpack for the Python Language
http://docs.cloudfoundry.org/buildpacks/
Apache License 2.0
121 stars 279 forks source link

Not enough information about workers being killed by kernel OOM #58

Closed keymon closed 7 years ago

keymon commented 7 years ago

What version of Cloud Foundry are you using?

v245

What version of the buildpack you are using?

python_buildpack-cached-v1.5.10.zip

If you were attempting to accomplish a task, what was it you were attempting to do?

When running an application that uses python gunicorn with very little memory, the gunicorn workers get killed by the kernel OOM. That is fine and expected.

But the gunicorn master is monitoring those workers and restarts them immediately, which is fine.

The problem is that this restart event is never logged properly, or detected by garden+rep, so you cannot easily visualise what is happening.

What was the actual behavior?

What did you expect to happen?

To get more information about what is actually happening. We could log that the worker has been killed and restarted, if possible explaining that was due a Out of Memory.

Maybe this can be implemented with hooks/signals http://docs.gunicorn.org/en/stable/signals.html

I also wonder:

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/135695553

The labels on this github issue will be updated when the story is started.

keymon commented 7 years ago

If you want, we can work on a patch. But let us know the approach you want to follow to fix it.

jchesterpivotal commented 7 years ago

Is it OOMing in the staging container or the runtime container?

keymon commented 7 years ago

Runtime.

Also, the app appeared as running, cf events did not report any problem. There were occasional errors from the gorouter (502) due the gunicorn workers being killed.

The logs of gunicorn showed that the workers were all time being started, which would suggest that something was going on, but you need to have context to troubleshoot that and know that they were being killed by OOM.

We learnt it from the kernel messages.

jchesterpivotal commented 7 years ago

@keymon Not really something we can address in the buildpack per se, as this code is only present and executed at staging time.

@ematpl who should we refer to in runtime land? Garden team? Diego? Mystery-OOMing has been a thing as long as I can recall.

keymon commented 7 years ago

@keymon Not really something we can address in the buildpack per se, as this code is only present and executed at staging time.

No really, as we could configure gunicorn in a different way. As mention, add a hook on the worker exit, or configure the gunicorn master to exit if the worker dies.

jchesterpivotal commented 7 years ago

As it happens, Python buildpack is the only official buildpack that doesn't provide a launch command automatically. It always has to be user-supplied.

This is due to a decision by the Heroku upstream maintainer -- we don't agree, but as yet haven't decided to change our behaviour to provide default commands.

Somewhere in your deployment logic you must be using either an application manifest.yml, or passing the -c param to the cf CLI. Or, I suppose, doing something in a .profile script. As a start, that's where reconfiguring gunicorn would take place.

jchesterpivotal commented 7 years ago

We could definitely update the doc examples, though. Can you give an example of what you want for gunicorn? We are regretfully shallow on python expertise at this end.

athornton2012 commented 7 years ago

@keymon any thoughts on @jchesterpivotal's comments?

keymon commented 7 years ago

Sorry, I did not come back to you earlier. Let me discuss it with my team and I will come back to you tomorrow.

Jonty commented 7 years ago

We have decided not to pursue this as it turns out CloudFoundry has built-in support for monitoring OOM events in containers, so we are going to focus on exposing this to tenants rather than modifying the buildpack to make it more obvious. Feel free to close this issue!

emalm commented 7 years ago

Thanks for the discussion in Slack about this matter, @Jonty. I've captured the request for better messaging about OOM events in https://www.pivotaltracker.com/story/show/136243763 in the Diego team's tracker.

Best, Eric

keymon commented 7 years ago

Cool @ematpl!

@Jonty also did a possible implementation, https://github.com/cloudfoundry/executor/pull/20. Could you refer to it in your card as a comment?

Thx

emalm commented 7 years ago

Thanks, @keymon, I added a link to @Jonty's PR in the description of that story.

vanschelven commented 1 year ago

as it turns out CloudFoundry has built-in support for monitoring OOM events in containers

@Jonty could you share what this built-in support is, and how it could be shared? I cannot find it in the docs.

Jonty commented 1 year ago

@vanschelven Apologies, it's been 7 years since I last looked at this and I can't remember anything about it!