fact-project / shifthelper

So we can sleep at night.
4 stars 0 forks source link

Shepherd #231

Closed dneise closed 7 years ago

dneise commented 7 years ago

Do not merge I should have first opened an issue to discuss this and then write code ... but I wrote code and opened the PR. Well I guess we can simply discuss it here... Do not merge

Adrian wished to have a kind of Shepherd (german: "Hüter") running on a host different from the shifthelper host and also not on La Palma.

The Shepherd should check if the shifthelper is running. To be precise we only want to check if the shifthelper is actually online as in:

We do not want to check if all the threads inside the shifthelper are running and doing their job properly, as this would simply be impossible.

So this "shepherd" simply checks if it can see the shifthelper web interface ... if that is the case, we are happy.

Still missing In case the shepherd dies ... the shifthelper can go offline without anybody noticing. So the shifthelper should get an additional check to see if the shepherd is still running (I guess, here we are also fine as long as we can verify that the shepherd is still online)

This "shepherd" check is still missing.

I would like to be able to check for the existence of the shepherd without using a URL or IP or so. So in the rare case where the shifthelper is all fine, but the shepherd is down and the developers gets called ... I would like to be able to simply spawn a new shepherd somewhere in the world ... not necessarily on the original shepherds node.

I think this would be possible by implementing a dead-mans-grip in the web interface. All the shepherd has to do is to request a certain URL every, idk 10 minutes (with auth, I guess ... otherwise google can do that for us :-)

The webinterface notes the last time the dead-mans-grip was requested. The "shepherd-check" can now reqeust that time in order to learn if the shepherd is still alive.

Now comes the bonus. We can spawn as many shepherds as we like. As long as atleast one shepherd is alive, no developer is getting called just for a dead shepherd. However ... in case the shifthelper dies ... many shepherds might call the developer.

And ... oh now it gets scrary .. what if the developer is not able to reanimate the shifthelper or the shifthelper node itself due to a power cut?

In that case of course the developer wakes up the shifter ... but he also needs to calm down all the blaring shepherds ... remembering where in the world they are running and trying to login on all those machines.

At this moment, I get the feeling it would really be better to not have: a shifthelper and a shepherd. But rather have many shifthelpers with a dead-mans-grip.

As long as their dead-man-grips are being pressed(requested) ... they sleep. And as soon as the grip is released they wake up and start checking the telescope and start to call shifters if needed, they also press the grips of any other shifthelper they know.

This sounds actually more reliable than having tho different processes.

I have read about a similar way of high availability mongodb servers. So maybe we don't have to write the code for this ourselves but just use it.

However also this approach has a problem. If we have N places in the world where SH instances might be running and they of course have all their own web-interface, how does the shifter know which SH instance is currently active, so where to go to press the "ACK" button?

We would need some kind of DNS failover here... and I personally don't know anything about this.


So back to the start. One shepherd and one SH instance ... and when SH dies ... shepherd calls me, and I have to wake up the shifter and kill the shepherd in order to have peace again. If shepherd dies ... shifthelper will call me, so I can either spawn a new shepherd somewhere or wake up the shifter and kill SH in order to have peace.

Since all of this happens very very rarely (I hope) it is probably fine.

dneise commented 7 years ago

I would like to check the shifthelper, not the webinterface. But that would require more work I guess.

I wanted to catch, powerloss and network failure ... therefore checking webinterface seemed fine. And yes .. checking shifthelper is somehow more work.

However you are right, checking shifthelper is better and better is always better ;-)

dneise commented 7 years ago

Why not use custos for the Check?

Because I find it hard to use. But its definitely the thing to do. I somehow didn't see it in the beginning... now I clearly that custos should be used here.

maxnoe commented 7 years ago

Because it's hard to use

That's a shame. If you think so, why didn't you open an issue for this core project of the shifthelper? If it is hard to use for such a simple example, we failed.