CroatianMeteorNetwork / RMS

RPi Meteor Station
https://globalmeteornetwork.org/
GNU General Public License v3.0
181 stars 50 forks source link

Logger - socket exception #428

Open g7gpr opened 1 month ago

g7gpr commented 1 month ago

https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/Logger.py#L44

On prerelease Pi4, seeing sporadic messages referring to this line.

2024/09/25 09:52:11-INFO-Logger-line:44 - Socket exception: Connection reset by peer (104)

I don't understand what is happening here, apologies for this not being a very helpful submission.

Cybis320 commented 1 month ago

I think the log is doing its job and capturing some sort of error. Any chance you could share the whole log? Is it associated with a video stream disconnect or something?

g7gpr commented 1 month ago
2024/09/25 09:10:15-INFO-EventMonitor-line:2144 - Next EventMonitor run : 01:40:15 UTC; 30.0 minutes from now
2024/09/25 09:10:15-INFO-EventMonitor-line:2148 - Next Capture start    : 10:38:49 UTC
2024/09/25 09:40:16-INFO-EventMonitor-line:2144 - Next EventMonitor run : 02:10:16 UTC; 30.0 minutes from now
2024/09/25 09:40:16-INFO-EventMonitor-line:2148 - Next Capture start    : 10:38:49 UTC
2024/09/25 09:52:11-INFO-Logger-line:44 - Socket exception: Connection reset by peer (104)
2024/09/25 10:10:18-INFO-EventMonitor-line:2144 - Next EventMonitor run : 02:40:18 UTC; 30.0 minutes from now
2024/09/25 10:10:18-INFO-EventMonitor-line:2148 - Next Capture start    : 10:38:49 UTC
2024/09/25 10:40:20-INFO-EventMonitor-line:2144 - Next EventMonitor run : 03:10:20 UTC; 30.0 minutes from now
2024/09/25 10:40:20-INFO-EventMonitor-line:2148 - Next Capture start    : 10:38:49 UTC

Not capturing at the moment, unfortunately I can't download the whole log at the moment

Cybis320 commented 1 month ago

oh, interesting. It's no even capturing...

Cybis320 commented 1 month ago

It could be this bit: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/Formats/ObservationSummary.py#L194-L217

Maybe, wrap it in a try-except block and add a timeout?

def timestampFromNTP(addr='0.us.pool.ntp.org'):
    REF_TIME_1970 = 2208988800  # Reference time
    client = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

    # Set the socket timeout (e.g., 5 seconds)
    client.settimeout(5)

    try:
        data = b'\x1b' + 47 * b'\0'
        client.sendto(data, (addr, 123))
        data, address = client.recvfrom(1024)
        if data:
            t = struct.unpack('!12I', data)[10]
            t -= REF_TIME_1970
            return t
    except socket.error as e:
        print("Socket error: {}".format(e))
        return None
    finally:
        client.close()
g7gpr commented 1 month ago

Unlikely, that code has no business running at this stage.

Cybis320 commented 1 month ago

Yes, you are right. I see that now. Sorry.

Maybe something around her?: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/EventMonitor.py#L1292

g7gpr commented 1 month ago

That code executes at the start of an EventMonitor run, which is not yet due.

Cybis320 commented 1 month ago

So the error is occurring while the program is in this loop: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/EventMonitor.py#L2127

Which calls this function: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/EventMonitor.py#L2084

Which calls: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/EventMonitor.py#L1292

Which runs: https://github.com/CroatianMeteorNetwork/RMS/blob/4ab05b037ee5ada27cd7df1797e812b7d47b3a53/RMS/EventMonitor.py#L1310

Maybe warp it in a with and catch socket error?:

 try:
    if sys.version_info[0] < 3:
        with urllib2.urlopen(self.syscon.event_monitor_webpage) as response:
            web_page = response.read().splitlines()
    else:
        with urllib.request.urlopen(self.syscon.event_monitor_webpage) as response:
            web_page = response.read().decode("utf-8").splitlines()
except (socket.error, urllib2.URLError, urllib.error.URLError) as e:
    log.error("Socket or URL error in getEventsfromWebPage: {}".format(e))
    return events
except Exception as e:
    log.error("Unexpected error in getEventsfromWebPage: {}".format(e))
    return events
g7gpr commented 1 month ago

Error was raised at 09:52:11, that code ran successfully at 09:40:15 and was not scheduled to run again until 10:10:18. Why do you think it is that code that had the exception?

Cybis320 commented 1 month ago

A likely cause of the error is that a socket is being opened but not properly closed by the client. If the client fails to close the socket after completing its operation, the server may eventually close the connection after a period of inactivity, which is what the “Connection reset by peer” message suggests. This typically happens when the server forcefully terminates an idle or incomplete connection.

It’s also important to note that there can be a delay between when the socket is opened and when the server closes it. The message may not appear immediately after the socket is opened but rather after the server detects inactivity or hits a timeout, which can occur several minutes later.

It'd be worth trying using a with statement when opening the socket ensures that the connection is properly closed on the client side once the operation is complete.

Cybis320 commented 1 month ago

Also, I think the code as it is keeps opening a new socket every 30 minutes without ever closing them.