cloudfoundry / jvmkill

Terminate the JVM when resources are exhausted
Apache License 2.0
30 stars 11 forks source link

Heap histogram and dedicated parameters parsing code #2

Closed rafaelri closed 8 years ago

rafaelri commented 8 years ago

This PR integrates polarbear heap histogram code into jvmkill agent. Apart from this, the code for parsing agent parameters has been moved to a new ParametersParser class that creates AgentParameters structs from agent command line parameters. Also there were changes to the jvmkill.c++ in order to extract the logic that mediated the OOM events between the Heuristics implementation and the registered actions this resulted in the new class AgentController that aims to hold all this responsibility and also to make tests easier, turning jvmkill.c++ in a simple "glue" between the JVM and our agent.

cfdreddbot commented 8 years ago

Hey rafaelri!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

glyn commented 8 years ago

Thanks @rafaelri. I notice the PolarBear code is not MIT licensed, contrary to what we thought, so I'm checking the acceptability of the license.

rafaelri commented 8 years ago

@glyn ok. I guess we have two possibilities: rewritting without anything from Polarbear code or I guess (if the combination thing is the issue with the Polarbear license and not its distribution) perharps we could make it a separate .so, dlsym it and call it from ours (ugly but still a possibility isn't it?). Although I really feel like rewritting is our best option.

2016-04-22 10:14 GMT-03:00 Glyn Normington notifications@github.com:

Thanks @rafaelri https://github.com/rafaelri. I notice the PolarBear code is not MIT licensed, contrary to what we thought, so I'm checking the acceptability of the license.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-213424539

glyn commented 8 years ago

@rafaelri I agree rewriting the PolarBear code is the best option and I'll start looking into that today.

rafaelri commented 8 years ago

@glyn nice. Let me help you on this. I hope next week I'll have time to work on it. Em 25/04/2016 04:25, "Glyn Normington" notifications@github.com escreveu:

@rafaelri https://github.com/rafaelri I agree rewriting the PolarBear code is the best option and I'll start looking into that today.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-214179658

glyn commented 8 years ago

Hi @rafaelri. @cgfrost took your PR, deleted the PolarBear code, and pushed it to a new PolarBear-ReWrite branch. I quickly stubbed out the histogram printing function to get the code to compile. There is a build error on mac os x which we need to look into, but hopefully the branch should build ok on Linux, although I haven't had a chance to try it.

If you want to take a crack at the rewrite, please do it without reference to the PolarBear code to avoid copyright contamination. I suggest using JVMTI FollowReferences to traverse all the live objects in the heap. We can set a tag when we visit an object for the first time and then check that the tag isn't the one we set before we visit an object and set the "visit control flags" to zero if we have already visited the object (and therefore don't want to visit it again) and to JVMTI_VISIT_OBJECTS is we haven't already visited the object. We need to choose a tag suitably so that it is unlikely to have already been used. As we traverse the heap, we need to accumulate the number of objects of each class and the total memory consumed by the objects of each class in order to print out the histogram. I guess we'll need a datastructure such as a hashtable to record the data for each class. We also need to be able to enumerate the classes in the hashtable, so we might need to keep a separate list depending on what kind of hashtable we use (my C++ is decades old and therefore very rusty).

How does that sound? Have you any better ideas?

rafaelri commented 8 years ago

Hi @glyn as I told you this week will be tough for me but I'll have a look at the documentation you sent. Hopefully next week I'll be able to work on this.

2016-04-25 13:09 GMT-03:00 Glyn Normington notifications@github.com:

Hi @rafaelri https://github.com/rafaelri. @cgfrost https://github.com/cgfrost took your PR, deleted the PolarBear code, and pushed it to a new PolarBear-ReWrite branch. I quickly stubbed out the histogram printing function to get the code to compile. There is a build error on mac os x which we need to look into, but hopefully the branch should build ok on Linux, although I haven't had a chance to try it.

If you want to take a crack at the rewrite, please do it without reference to the PolarBear code to avoid copyright contamination. I suggest using JVMTI FollowReferences http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#FollowReferences to traverse all the live objects in the heap. We can set a tag when we visit an object for the first time and then check that the tag isn't the one we set before we visit an object and set the "visit control flags" to zero if we have already visited the object (and therefore don't want to visit it again) and to JVMTI_VISIT_OBJECTS is we haven't already visited the object. We need to choose a tag suitably so that it is unlikely to have already been used. As we traverse the heap, we need to accumulate the number of objects of each class and the total memory consumed by the objects of each class in order to print out the histogram. I guess we'll need a datastructure such as a hashtable to record the data for each class. We a lso need to be able to enumerate the classes in the hashtable, so we might need to keep a separate list depending on what kind of hashtable we use (my C++ is decades old and therefore very rusty).

How does that sound? Have you any better ideas?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-214421207

glyn commented 8 years ago

@rafaelri no problem. We just need to share whatever progress we make so that we don't duplicate effort.

rafaelri commented 8 years ago

@glyn do you have any suggestions for sharing the progress? And also I'll have the monday morning free I guess we could do a huge progress on this next monday. Em 26/04/2016 05:15, "Glyn Normington" notifications@github.com escreveu:

@rafaelri https://github.com/rafaelri no problem. We just need to share whatever progress we make so that we don't duplicate effort.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-214663864

glyn commented 8 years ago

@rafaelri We should simply push our work to a branch each day we get to work on this and drop a comment in this issue to say how far we've got. Please note next Monday is a public holiday in the UK, so I won't be around.

glyn commented 8 years ago

@rafaelri I'm assuming you haven't made any progress in the last couple of weeks. I'll see what I can do today and push that.

rafaelri commented 8 years ago

Yep. Last minute personal problems to solve but I'll be online soon and although I didn't finish any production code I made some progress. Em 09/05/2016 05:47, "Glyn Normington" notifications@github.com escreveu:

@rafaelri https://github.com/rafaelri I'm assuming you haven't made any progress in the last couple of weeks. I'll see what I can do today and push that.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-217809149

glyn commented 8 years ago

@rafaelri Ok, hope things work out. Meanwhile, I'm trying to understand why the PolarBear-ReWrite branch which includes your changes, fails to build with:

dlsym(0x7fdf0bc03380, resourceExhausted): symbol not found

in the ctests make target. At master, where this test passes, there is a _resourceExhausted symbol in the resultant libmjvmkill.so, but with the PolarBear-ReWrite branch, the symbol is mangled to __Z17resourceExhaustedP9_jvmtiEnvP7JNIEnv_iPKvPKc.

rafaelri commented 8 years ago

Missing the nomangle cpp directive? But why??? It was already there... anyways I will be online in a few minutes. Traffic jams are the rule in Brazil and when it rains things get really messy... Em 09/05/2016 06:58, "Glyn Normington" notifications@github.com escreveu:

@rafaelri https://github.com/rafaelri Ok, hope things work out. Meanwhile, I'm trying to understand why the PolarBear-ReWrite branch which includes your changes, fails to build with:

dlsym(0x7fdf0bc03380, resourceExhausted): symbol not found

in the ctests make target. At master, where this test passes, there is a _resourceExhausted symbol in the resultant libmjvmkill.so, but with the PolarBear-ReWrite branch, the symbol is mangled to __Z17resourceExhaustedP9_jvmtiEnvP7JNIEnv_iPKvPKc.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-217823938

glyn commented 8 years ago

@rafaelri ok. I think agentcontroller.c++ might need a bit of rework too, e.g. to enable setSignal to be passed down (but only to the kill action?!).

glyn commented 8 years ago

@rafaelri BTW, I think we can avoid testing parameter parsing in the ctests make target now you have separated out parameter parsing into a separately tested C++ file. This will simplify the interface that agentcontroller.c++ needs to support.

glyn commented 8 years ago

@rafaelri I wonder if we should avoid exposing setSignal from the shared library altogether so that both actions can implement the same interface? But the C tests need to set the signal to something benign, so it might be necessary to add a parameter for configuring the signal that killaction drives. Or something like that. Essentially, it's a pain in the neck to downcast an action to a killaction and this becomes much worse when there are more than one action around.

rafaelri commented 8 years ago

@glyn I just checked out the PolarBear-ReWrite branch. I am also thinking whether we could avoid the tests.c altogether now since the rest is being testes separately. What do you think? We could let this be testes on the integrated tests (the one with the JVM itself). BTW, if you prefer (instead of chatting here) my skype and hangout accounts are the same as my github user, feel free to add.

rafaelri commented 8 years ago

@glyn killactionstests.c++ tests whether we send the signal and allows overriding so test can run fine. So, we have signal sending tested. Next thing I see is that agent controller mocks out killaction.c++ and hence we can assure that if in the final code we have the right KillAction class it will be called. I completely agree with you that we don't need these tests (tests.c) anymore.

rafaelri commented 8 years ago

@glyn to be sincere this was part of the reasoning behind this refactoring (removing this code from the jvmkill.c++)

rafaelri commented 8 years ago

@glyn returning to the histogram. I had the impression we could adopt a really simple approach but I just noticed it reports unreachable instances and doesn't report instance count. I'll post it here but disregard it completely: We could list the loaded classes using GetLoadedClasses and then call IterateThroughHeap for each class and store it in a list that contains a class with the following attributes: (class signature) [http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#GetClassSignature] heap size returned from this callback. The last step would be to sort this class using a comparator based on the size field.

But as I said... it won't suffice.

glyn commented 8 years ago

@rafaelri I have meetings the rest of the day, so can't hangout or skype today unfortunately. I agree that we could reasonably drop the C tests.

As for the histogram, iterating through the whole heap should sometimes give a good approximation since presumably the garbage collector will have done a full GC before OOM is raised. However, the histogram function will currently also run when the JVM has run out of threads, so there could be lots of dead objects on the heap in that case.

Therefore, we should iterate over the live portion of the heap using a function such as IterateOverReachableObjects and set our own tag to detect when we are revisiting the same object (graph) and avoid counting the same object twice.

Please could you prioritise getting the build clean with the stubbed out histogram code (as at the tip of the PolarBear-ReWrite branch)? Once we have the structure right, we can implement the histogram function fairly easily I think.

On Mon, May 9, 2016 at 12:36 PM, Rafael de Albuquerque Ribeiro < notifications@github.com> wrote:

@glyn https://github.com/glyn returning to the histogram. I had the impression we could adopt a really simple approach but I just noticed it reports unreachable instances and doesn't report instance count. I'll post it here but disregard it completely: We could list the loaded classes using GetLoadedClasses http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#GetLoadedClasses and then call IterateThroughHeap http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#IterateThroughHeap for each class and store it in a list that contains a class with the following attributes: (class signature) [ http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#GetClassSignature] heap size returned from this callback http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#jvmtiHeapIterationCallback . The last step would be to sort this class using a comparator based on the size field.

But as I said... it won't suffice.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-217841506

Regards, Glyn

rafaelri commented 8 years ago

@glyn ok! I'll focus on that!

2016-05-09 9:27 GMT-03:00 Glyn Normington notifications@github.com:

@rafaelri I have meetings the rest of the day, so can't hangout or skype today unfortunately. I agree that we could reasonably drop the C tests.

As for the histogram, iterating through the whole heap should sometimes give a good approximation since presumably the garbage collector will have done a full GC before OOM is raised. However, the histogram function will currently also run when the JVM has run out of threads, so there could be lots of dead objects on the heap in that case.

Therefore, we should iterate over the live portion of the heap using a function such as IterateOverReachableObjects and set our own tag to detect when we are revisiting the same object (graph) and avoid counting the same object twice.

Please could you prioritise getting the build clean with the stubbed out histogram code (as at the tip of the PolarBear-ReWrite branch)? Once we have the structure right, we can implement the histogram function fairly easily I think.

On Mon, May 9, 2016 at 12:36 PM, Rafael de Albuquerque Ribeiro < notifications@github.com> wrote:

@glyn https://github.com/glyn returning to the histogram. I had the impression we could adopt a really simple approach but I just noticed it reports unreachable instances and doesn't report instance count. I'll post it here but disregard it completely: We could list the loaded classes using GetLoadedClasses < http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#GetLoadedClasses

and then call IterateThroughHeap < http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#IterateThroughHeap

for each class and store it in a list that contains a class with the following attributes: (class signature) [

http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#GetClassSignature ] heap size returned from this callback < http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#jvmtiHeapIterationCallback

. The last step would be to sort this class using a comparator based on the size field.

But as I said... it won't suffice.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-217841506

Regards, Glyn

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-217850629

rafaelri commented 8 years ago

@glyn commited the code without the tests.c in a branch on my account (PolarBear-ReWrite). I'll attempt to implement the histogram using IterateOverReachableObjects as you suggested.

glyn commented 8 years ago

Thanks @rafaelri. Your branch builds cleanly on Mac. I'll be going back to this on Monday, so please push any histogram code you manage to write before then, if you would like me to take a look or add to it. I'm not quite sure of the semantics of tags, but picking a "random" tag each time you need to iterate over the heap is probably not a bad starting point.

On Thu, May 12, 2016 at 12:15 PM, Rafael de Albuquerque Ribeiro < notifications@github.com> wrote:

@glyn https://github.com/glyn commited the code without the tests.c in a branch on my account (PolarBear-ReWrite). I'll attempt to implement the histogram using IterateOverReachableObjects as you suggested.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cloudfoundry/jvmkill/pull/2#issuecomment-218728397

Regards, Glyn

rafaelri commented 8 years ago

@glyn me too. I've just spent some time reading the documentation around tags but I am still unsure on how to use them. I'll let you know if I find something relevant over here.

glyn commented 8 years ago

@rafaelri I've sent you a skype contact request so we can chat about tags.

glyn commented 8 years ago

Closing out now the heap histogram function has been integrated into master. Thanks @rafaelri for your work on this!