NetComposer / nklib

NetComposer common library functions
Apache License 2.0
11 stars 35 forks source link

If nklib_proc registry crashes it must kill all client processes have called 'put'. #2

Closed vjache closed 7 years ago

vjache commented 9 years ago

Hi,

I have encountered an interesting case. First of all I must to say that I can not reproduce this situation any more. When I brought my notebook into hibernation state (with working nkcluster) for a long time (~12 hours), then, when I turned it on again, I discovered that the process 'nklib_proc' were dead and restarted for some reason. Sadly I could not to figure out the reason why process crashed, logs were empty. My conclusion about process were restarted based on that fact that its pid had a much much bigger number in comparison with its neighbors by supervisor. The effect on a system was that some process constantly restarted due to failure on call 'nkcluster_agent:get_status' due to 'nklib_proc:values' returned '[]'.

So, if to admit that there is some bug in a nklib_proc which exhibits itself under some rare circumstances then the system will enter into corrupted state in case of nklib_proc crash. In such a case nklib_proc must cause a crash of all client processes registerd (called 'put'). This may be achieved by using trap_exit=true + link/1 instead of monitor/2.

Details see in a pull request followed by this message.