NodeServer AutoRestart discussion

Einstein42 commented 7 years ago

I've seen a few random crashes in various nodeservers that should be automatically restarted. Now the problem is we don't want to restart it so many times it floods everything on a legitimate error, so we need some sort of mechanism to do so. @mjwesterhof does the Pulse server already do this? I need to get some kind of information from you on how that mechanism works so I can document it.

mjwesterhof commented 7 years ago

The pulse node server does nothing that would help for this. It's an interesting idea, though. Here's some thoughts on this...

I can see one huge argument against this, first off -- that's the "Purist's Argument": this is a feature that would only benefit "buggy code" and the correct solution is for each node server to fix its own code. The counter argument to that is that in the real world, stuff happens. And more importantly, stuff can happen in dependencies that the node server author has no ability to fix or control.

Secondly, one might argue that there's already such a feature: the heartbeat (ping/pong messages). I'd counter that by noting that the heartbeat as implemented is insufficient in many cases.

Assuming that one agrees we should add code to do this, and that the existing heartbeat mechanism is inadequate, the next issue is what we need to do instead of or in addition to the heartbeat? Clearly we need to handle the exiting of the node server (the heartbeat mechanism handles only a hanging node server). That would imply we need to add another mechanism, though, to signal a legitimate exit -- so another message to the node server manager to say "I'm exiting for all the right reasons, please don't restart me", and another that says "I'm exiting to restart myself, please do so but don't log this as a failure". And if we don't get one of the aforementioned messages before the node server process exits, then polyglot would log the message and restart it.

So, now we'd need limits on the restart. It seems reasonable to only restart a certain number of times (call it 3 for argument's sake). However, since a node server is an "always running" server, and the node server manager may run for months (as long as the host OS isn't Windows, of course :-) ), we probably need a time element on that. So perhaps we say 3 times in the past hour.

The above might be sufficient - but we also might find it useful to add logic to handle badly-behaved node servers that are struggling but actually doing something... this is complex. And I'm not sure I support this, because of that. But conceptually, the node server manager would keep track of the inbound and outbound messages to the ISY (which is all it can see), and using some sort of heuristic based on that, it might keep restarting a node server that fails more often than the aforementioned timeout mechanism allows. The problem is selecting that heuristic.

Footnote: I fear this. Since I've started working on and with Polyglot, I've added more and more features and code and special cases. All of it, in isolation, seemed like a good idea at the time. As I reflect, I feel that Polyglot has become too big and too unwieldy, and I fear I've just added bloat to it with some of the code. And an auto-restart seems like a good idea, but again, it just makes it more complicated and bigger... I'm feeling that we're heading in the wrong direction. (This is not to say that I've a better idea - just to share that I'm not sure going deeper down this path is the right way!)

mkohanim commented 7 years ago

Hi Mike,

Thanks so very much for the feedback. I would like to discuss your foot note: if the whole community agrees that we should start fresh and, specifically, add support for Micro Python, then I am all for it. We have big plans for Polyglot and cannot start with buggy and bloated code. The main questions are:

Whether or not everyone agrees?
Are there any other alternatives?
If we start with this approach, can we also address node servers exiting gracefully vs. heartbeat vs. buggy code?

With kind regards,

Michel Kohanim CEO

(p) 818.631.0333 (f) 818.436.0702 http://www.universal-devices.com http://www.universal-devices.com/

From: Mike Westerhof [mailto:notifications@github.com] Sent: Sunday, January 22, 2017 12:44 PM To: UniversalDevicesInc/Polyglot Polyglot@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [UniversalDevicesInc/Polyglot] NodeServer AutoRestart discussion (#80)

The pulse node server does nothing that would help for this. It's an interesting idea, though. Here's some thoughts on this...

I can see one huge argument against this, first off -- that's the "Purist's Argument": this is a feature that would only benefit "buggy code" and the correct solution is for each node server to fix its own code. The counter argument to that is that in the real world, stuff happens. And more importantly, stuff can happen in dependencies that the node server author has no ability to fix or control.

Secondly, one might argue that there's already such a feature: the heartbeat (ping/pong messages). I'd counter that by noting that the heartbeat as implemented is insufficient in many cases.

Assuming that one agrees we should add code to do this, and that the existing heartbeat mechanism is inadequate, the next issue is what we need to do instead of or in addition to the heartbeat? Clearly we need to handle the exiting of the node server (the heartbeat mechanism handles only a hanging node server). That would imply we need to add another mechanism, though, to signal a legitimate exit -- so another message to the node server manager to say "I'm exiting for all the right reasons, please don't restart me", and another that says "I'm exiting to restart myself, please do so but don't log this as a failure". And if we don't get one of the aforementioned messages before the node server process exits, then polyglot would log the message and restart it.

So, now we'd need limits on the restart. It seems reasonable to only restart a certain number of times (call it 3 for argument's sake). However, since a node server is an "always running" server, and the node server manager may run for months (as long as the host OS isn't Windows, of course :-) ), we probably need a time element on that. So perhaps we say 3 times in the past hour.

The above might be sufficient - but we also might find it useful to add logic to handle badly-behaved node servers that are struggling but actually doing something... this is complex. And I'm not sure I support this, because of that. But conceptually, the node server manager would keep track of the inbound and outbound messages to the ISY (which is all it can see), and using some sort of heuristic based on that, it might keep restarting a node server that fails more often than the aforementioned timeout mechanism allows. The problem is selecting that heuristic.

Footnote: I fear this. Since I've started working on and with Polyglot, I've added more and more features and code and special cases. All of it, in isolation, seemed like a good idea at the time. As I reflect, I feel that Polyglot has become too big and too unwieldy, and I fear I've just added bloat to it with some of the code. And an auto-restart seems like a good idea, but again, it just makes it more complicated and bigger... I'm feeling that we're heading in the wrong direction. (This is not to say that I've a better idea - just to share that I'm not sure going deeper down this path is the right way!)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/UniversalDevicesInc/Polyglot/issues/80#issuecomment-274357913, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANlUvdhV7ggsxEeSBlgABdOuXbsllw2Mks5rU7-VgaJpZM4Lk6Gp.

Einstein42 commented 7 years ago

Well that certainly took a turn I didn't expect. My thoughts on code bloat are that the interface is excessive for what it needs to do, I asked @i814u2 to jump in on that topic and clean/simplify the web interface while leaving the API and what not intact. This will reduce the size overall by about 30-40% depending on what he comes up with.

I think using the existing ping structure to restart the server 3 times in maybe an hour if it hangs is acceptable. Other than that they need to fix their code. If it exits gracefully, then it obviously meant to do that, so leave it alone.

As far as dependency failures, I've contemplated that and would like to do some kind of static package model per nodeserver that includes the python modules locally or require pip requirements.txt == instead of => to avoid those scenarios. That's probably another topic however.

I understand you added a few things for the pulse server, timers, and checks and what not. Did you add anything else you would consider bloat? I don't believe going back to scratch would be necessary, but if you have anything specific I'm open to hear what we should thin out. The MQTT subsystem doesn't run or instantiate unless someone is using it, so I wouldn't consider that bloat either. All ears.

Thanks for your thoughts!

-E

jimboca commented 7 years ago

Personally I don't feel that Polyglot itself is bloated, but I may be missing where that is comment is directed. I do agree that the web interface has a lot of data, but the use of Polymer would make adding new features easier in the future. Personally I like the look and usability of the web interface, but I seem to be in the minority. It would be great if it could be extended to help with building the node servers configurations instead of users having to manually edit the yaml. It seems the Polymer would make this easier, but my web development experience is many years old so I am not up to date on all the latest. I know that @rmkraus has a lot more experience with this and probably had very good reasons to use this library, so would be nice to hear from him before things were changed to something simpler that would make it more work to extend in the future.

I agree with @Einstein42 that when a node server exits cleanly, it can be ignored. But when it aborts do something about it, like restart and send an alarm status back to the ISY, also the same for a hung server. Can this alarm be handled by a standard node that each server is required to create, or possibly be managed by the Pulse server and all users would be requested to install that one?

Jim

rmkraus commented 7 years ago

Just weighing in quickly as this should probably be a different discussion:

for optimizing the front end: https://www.polymer-project.org/1.0/docs/tools/optimize-for-production

That should compile the whole front end down to one rather small file.

If you wanted to rewrite the front end, there are a few libraries that are just as good. Everyone has a preference, mine is Polymer. A decent alternative would be React. But that's mostly preference.

If you just want the front end to load faster and take up less space in the repository, you could split out the front end to its own git repo and only include the vulcanized file in this repo.

Sorry for the aside as I know this has nothing to do with this discussion.

Einstein42 commented 7 years ago

Thanks Ryan. That's good to know. @i814u2 might be able to expound on that a little more.

mkohanim commented 7 years ago

Thanks everyone for the feedback.

With kind regards,

Michel Kohanim CEO

(p) 818.631.0333 (f) 818.436.0702 http://www.universal-devices.com http://www.universal-devices.com/

From: James Milne [mailto:notifications@github.com] Sent: Sunday, January 22, 2017 6:02 PM To: UniversalDevicesInc/Polyglot Polyglot@noreply.github.com Cc: Michel Kohanim michel@universal-devices.com; Comment comment@noreply.github.com Subject: Re: [UniversalDevicesInc/Polyglot] NodeServer AutoRestart discussion (#80)

Thanks Ryan. That's good to know. @i814u2https://github.com/i814u2 might be able to expound on that a little more.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/UniversalDevicesInc/Polyglot/issues/80#issuecomment-274380449, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANlUvdN2LwPqHFqNnEdfj7d-Qn--w7Auks5rVAohgaJpZM4Lk6Gp.

Einstein42 commented 7 years ago

Just a reminder, the slack channels would be great locations for this discussion. If anyone is not on them please send me your email addresses.

mkohanim commented 7 years ago

Hi James, I am not … I don’t even know what it is!

With kind regards,

Michel Kohanim CEO

(p) 818.631.0333 (f) 818.436.0702 http://www.universal-devices.com http://www.universal-devices.com/

From: James Milne [mailto:notifications@github.com] Sent: Sunday, January 22, 2017 6:22 PM To: UniversalDevicesInc/Polyglot Polyglot@noreply.github.com Cc: Michel Kohanim michel@universal-devices.com; Comment comment@noreply.github.com Subject: Re: [UniversalDevicesInc/Polyglot] NodeServer AutoRestart discussion (#80)

Just a reminder, the slack channels would be great locations for this discussion. If anyone is not on them please send me your email addresses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/UniversalDevicesInc/Polyglot/issues/80#issuecomment-274382740, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANlUvXF0TJSTzmq9Cddt39F920kxpZwhks5rVA7bgaJpZM4Lk6Gp.

Einstein42 commented 7 years ago

Michel, sorry for the delay. We created the slack (team messaging/channels/collaboration) channels a few months ago. I believe you signed up for a free account, as I added you. Try this link I sent you an invite in November to your email address.

Einstein42 commented 7 years ago

So I know this isn't the place for the sluggish UI behavior, but I think I found the culprit. Polyglot was causing the nodeservers to use a boatload of CPU. Pull the latest development build and see if it makes a difference for you please.

jimboca commented 7 years ago

@Einstein42 Yes, it is still sluggish but seems much better than before.

Einstein42 commented 7 years ago

Ok. Worked on this in conjunction with a speed related issue I was having in the LiFX nodeserver. I got the reponse time from ~2000 ms delay in the interface down to roughly ~100. My interface is speedy as all getout now. Let me know how it works for you.

jimboca commented 7 years ago

@Einstein42 Nice. I've updated and restarted, I'll let you know if I have any issues. UI does seem even a little better now.

UniversalDevicesInc / Polyglot

NodeServer AutoRestart discussion #80