Closed JonasGroeger closed 7 years ago
Hi @JonasGroeger, i have seen before weird behaviours but I was never able to reproduce them systematically and debug it. cousteau is using a 3rd party lib socketIO-client for socketio communication, so I am not sure what could go wrong. If you have though some steps for me to reproduce it I will be happy to debug it for you.
@astrikos I'll try to provide a minimal example:
#!/usr/bin/env python3
import json
from ripe.atlas import cousteau
def print_it(result):
print(json.dumps(result))
def main():
atlas_stream = cousteau.AtlasStream()
atlas_stream.connect()
atlas_stream.bind_channel("atlas_result", print_it)
atlas_stream.start_stream(
stream_type="result",
msm=4412911,
)
try:
atlas_stream.timeout()
except KeyboardInterrupt:
atlas_stream.disconnect()
if __name__ == '__main__':
main()
Unfortunately it can take up to 3 days until the faulty behaviour appears. Meanwhile, if you could think of a workaround for this, I'm happy to help.
@astrikos Is there at least a way how to detect this? Like a callback for all data received where I can see how much data was sent to me in a period of time. If some limit is reached, I'll restart the script myself.
@JonasGroeger I have to recheck lib's code to see if there is a way. In the meantime what you can do is enabling more verbose logging. Try the following in the beginning of the script:
import logging
logging.getLogger('socketIO-client').setLevel(logging.DEBUG)
logging.basicConfig()
Let me know if you see any debug output that would be helpful
I'm currently using another workaround: I kill the script whenever I don't get updates in a certain time period.
class StallCounter(object):
def __init__(self, timeframe, min_count):
"""
Counter that counts if at least min_count increments were received in the last timeframe.
:param timeframe:
:param min_count:
"""
self.counter = 0
self.counter_start = int(time.time())
self.timeframe = timeframe
self.min_count = min_count
def increment(self, count=1):
self.counter += count
def _reset_window(self):
self.counter_start = int(time.time())
def is_stalling(self):
now = int(time.time())
time_eleapsed = now - self.counter_start
if time_eleapsed < self.timeframe:
# Still waiting
return False
# Reset window
self._reset_window()
if self.counter >= self.min_count:
# All OK, we received enough
self.counter = 0
return False
return True
def reset(self):
self.counter = 0
self.counter_start = int(time.time())
stall_counter = StallCounter(timeframe=180, min_count=1)
# Gets called on EVERY message
atlas_stream.bind_channel("message", exit_if_stream_is_stalling)
def exit_if_stream_is_stalling(data):
from socketIO_client import parse_socketIO_packet_data
data = parse_socketIO_packet_data(data)
# Kill ourselves if we are stalled
if stall_counter.is_stalling():
print("Exit due to stalling: {} packages in the last {} seconds.".format(stall_counter.counter,
stall_counter.timeframe))
sys.exit(1)
try:
if data.args[0] in ['atlas_result']:
stall_counter.reset()
else:
stall_counter.increment()
except (KeyError, TypeError):
stall_counter.increment()
Yes, we're dropping the last message. I'm relying on systemd to restart myself.
can you maybe give me your stream parameters so i can try it myself and leave it running with verbose logging? maybe i find something useful
Hi @JonasGroeger, I think we were able to debug it yesterday. It seems that when you disconnect (network hickups, etc) the 3rd party library will try to reconnect but it will not subscribe again to the stream it was subscribed before disconnections. So it's no surprise you don't get anything back. Hopefully from what I see in docs of library there is a way to handle that by doing something like:
class Namespace(BaseNamespace):
def on_connect(self):
subscribe_to_my_atlas_stream()
print('[Connected]')
def on_disconnect(self):
print('[DisConnected]')
def on_reconnect(self):
print('[ReConnected]')
re_subscribe_to_my_atlas_stream()
io = SocketIO(
host="atlas-stream.ripe.net",
port=80,
resource="/stream/socket.io",
transports=["websocket"],
Namespace=Namespace
)
io.wait()
I will try to integrate that logic today at cousteau and then I hope you can give it a try see if it fixes problem for you.
Pull request #44 solves this issue.
Hey!
Under normal operation, cousteau works fine. After some hours of running however, it does not receive any data any more. It only receives what looks similar to heartbeats or the like:
Faulty Operation
I get similar system calls when its under normal operation but with data inbetween:
Normal Operation
Restarting the script resolves the issue immediately.
The code (nothing fancy) goes something like this:
It clearly is not the Atlas' problem, since restarting fixes the issue. If the Atlas were broken, restarting would not help.