Open dneise opened 7 years ago
from the logs (removed private information:
2017-10-13 05:50:09,844 - shifthelper.conditions - DEBUG - is_nobody_awake -> True
2017-10-13 05:50:09,907 - shifthelper.checks - DEBUG - Giving message status WARNING
2017-10-13 05:50:09,969 - shifthelper.notifiers - DEBUG - Got a message
2017-10-13 05:50:09,969 - shifthelper.notifiers - DEBUG - Message is over alert level
2017-10-13 05:50:09,969 - shifthelper.notifiers - DEBUG - Getting phone number of primary shifter
2017-10-13 05:50:09,990 - shifthelper.tools.shift - DEBUG - Found shifter <shifter>
2017-10-13 05:50:09,991 - shifthelper.notifiers - DEBUG - Getting phone number of fallback shifter
2017-10-13 05:50:09,991 - shifthelper.notifiers - INFO - Calling +*******83
2017-10-13 05:50:09,991 - custos.notify.twilio - DEBUG - Received message
2017-10-13 05:50:09,992 - custos.notify.twilio - DEBUG - Placing call
2017-10-13 05:50:10,902 - shifthelper.notifiers - INFO - Calling +********34
2017-10-13 05:50:10,903 - custos.notify.twilio - DEBUG - Received message
2017-10-13 05:50:10,903 - custos.notify.twilio - DEBUG - Placing call
2017-10-13 05:50:11,568 - shifthelper.tools.shift - DEBUG - Found shifter <shifter>
2017-10-13 05:50:11,947 - custos.notify.log - WARNING - Check: IsUserAwakeBeforeShutdown - Message: There is a shift at the moment and
20min before shutdown and
Parker not Awake
This is the part of code, responsible for calling the fallback:
if self._get_oldest_call_age() >= self.time_before_fallback:
log.debug('Getting phone number of fallback shifter')
numbers_to_call.append(self.phone_number_of_fallback_shifter())
This self._get_oldest_call_age()
looks at a list of calls and just finds the oldest call. Calls are removed from this list, if they are too old and if they were already acknowledged.
I found a bug in the function removing old and acknowledged calls:
def _remove_acknowledged_and_old_calls(self):
""" from the list of not acknowledged calls
remove all calls, which have been acknowledged on the web page
Also remove calls older than 2 hours, to get out of
a "call the backup shifter" dead lock
"""
alerts = {a['uuid']: a for a in get_alerts()}
if not alerts:
return
for msg in copy(self.not_acknowledged_messages):
age = datetime.datetime.utcnow() - msg.timestamp
if age > (self.max_time_for_fallback + self.time_before_fallback):
self.not_acknowledged_messages.remove(msg)
else:
try:
alert = alerts[str(msg.uuid)]
except KeyError:
continue
if alert['acknowledged'] is True:
self.not_acknowledged_messages.remove(msg)
It says: When you cannot get any alerts from the website ... return. So it does not even try to remove too-old alerts, when it cannot find out what was acknoledged. This function is doing to independent things... and if the one thing is impossible, it does not even try the other idenpendent thing. That is of course a bug.
I think, I might have recently introduced this bug .. let me see.
Ah the reason, why it showed up today was:
Normally there is at least one alert in the list of alerts ... the dummy-alert from the start-up checklist. But today this dummy alert was not in the list. Either the shifter did not try it. Or it did not work. Either way ... the shifter should at least have it reported in the log-book. If not having called one of the SH expert in the evening, because if the dummy alert is not working, this is serious. ...
Well ... this error (not doing the dummy-alert) ... made us realize this bug. ... so it was a good thing in the end.
The fix is easy:
instead of
alerts = {a['uuid']: a for a in get_alerts()}
if not alerts:
return
we just need:
alerts = {a['uuid']: a for a in get_alerts()}
alerts will be an empty dict now and the code below is fine with an empty set of alerts...
We are running the shifthelper_heartbeat branch at the moment and indeed I edited this part ... but the same behaviour (returning, instead of carrying on with an empty set of alerts) was already there before my edit .. so it wasnt me :-D
But I should have seen that (embarrassed)
This morning, SH called 20min before shutdown, because nobody was awake. so far so good. But it called the fallback immediately and not after 15 minutes. I do not know why.