Increasing number of CLOSE_WAIT TCP connections plus increasing number of connection errors in the ha log

kenchy commented 4 months ago

Hello,

SOLIS Hybrid 3K Inverter. I use solarman to get the values from the inverter and solismod through mqtt to poke some of the settings for battery charging etc. I'm running HA core 2024.2, upgraded last week.

I've seen more mqtt failures in the past week since the upgrade (a failure being the setting does not change as expected.) In the solismod log I'm seeing connection errors to the inverter (connection refused). I'm seeing similar in the solarman ha log:

2024-02-23 03:03:31.324 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 03:03:31.324 INFO (SyncWorker_10) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:03:31.329 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:03:31.329 INFO (SyncWorker_10) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:03:31.332 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [0] retry attempts left
2024-02-23 03:03:31.332 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying registers [33029 - 33095] failed, aborting.
2024-02-23 03:04:01.128 INFO (SyncWorker_36) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:04:01.134 WARNING (SyncWorker_36) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:04:01.135 INFO (SyncWorker_36) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:04:01.138 WARNING (SyncWorker_36) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 03:04:01.138 INFO (SyncWorker_36) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:04:01.142 WARNING (SyncWorker_36) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:04:01.142 INFO (SyncWorker_36) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:04:01.145 WARNING (SyncWorker_36) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [0] retry attempts left
2024-02-23 03:04:01.146 WARNING (SyncWorker_36) [custom_components.solarman.solarman] Querying registers [33029 - 33095] failed, aborting.
2024-02-23 03:04:31.130 INFO (SyncWorker_30) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:04:31.225 WARNING (SyncWorker_30) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:04:31.225 INFO (SyncWorker_30) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:04:31.227 WARNING (SyncWorker_30) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 03:04:31.227 INFO (SyncWorker_30) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:04:31.232 WARNING (SyncWorker_30) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:04:31.233 INFO (SyncWorker_30) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:04:31.235 WARNING (SyncWorker_30) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [0] retry attempts left
2024-02-23 03:04:31.235 WARNING (SyncWorker_30) [custom_components.solarman.solarman] Querying registers [33029 - 33095] failed, aborting.
2024-02-23 03:05:01.131 INFO (SyncWorker_3) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 03:30:01.189 WARNING (SyncWorker_18) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 03:30:01.189 INFO (SyncWorker_18) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 03:30:01.195 WARNING (SyncWorker_18) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 03:30:01.195 INFO (SyncWorker_18) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 04:00:03.868 WARNING (SyncWorker_29) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [V5FrameError: V5 frame contains invalid sequence number]
2024-02-23 04:00:03.869 INFO (SyncWorker_29) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 04:00:04.369 WARNING (SyncWorker_29) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 04:00:04.369 INFO (SyncWorker_29) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899
2024-02-23 06:30:01.996 WARNING (SyncWorker_12) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 06:30:01.996 INFO (SyncWorker_12) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 06:30:01.999 WARNING (SyncWorker_12) [custom_components.solarman.solarman] Querying inverter 4075426842 at 192.168.10.121:8899 failed on connection start with exception [KeyError: '41 is not registered']

As well as some other errors.

When I look for CLOSE_WAIT connections I see that most of them are from homeassistant (presumably solarman) to the inverter.

tcp     5589      0 192.168.10.1:45888      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:36970      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp       32      0 192.168.0.10:41218      3.123.196.97:443        CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:41340      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:54490      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:43524      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:58760      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:36948      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:47424      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp     5589      0 192.168.10.1:42606      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:41324      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp       32      0 192.168.0.10:42870      3.123.196.97:443        CLOSE_WAIT  317193/python3.11   
tcp     6145      0 192.168.10.1:42502      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:43500      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:58780      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp     5589      0 192.168.10.1:39312      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp     5575      0 192.168.10.1:58458      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:33450      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:56920      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:44642      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:56952      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:33470      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:44630      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:47410      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11   
tcp        1      0 192.168.10.1:54510      192.168.10.121:8899     CLOSE_WAIT  317193/python3.11

These go away when I restart homeassistant, and the connection errors do too. I restarted 24 hours ago, and these are the CLOSE_WAIT accumulated in that time. The network connection seem stable, I've been running a ping every 30 seconds for the past 24 hrs, and have seen no drops. My best guess is the excessive connections is overloading the logger in some way and causing it to become unreliable?

Is anyone else seeing this? I am running custom parameters, which was running successfully for a few months before the upgrade to the latest version last week.

kenchy commented 4 months ago

So maybe connected to a recent update. My count of CLOSE_WAIT went up by one between 15:25 and 15:30. In the log for this time period there is only one solarman error:

2024-02-23 15:30:32.847 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-02-23 15:30:32.848 INFO (SyncWorker_10) [custom_components.solarman.solarman] Disconnecting from solarman data logger 192.168.10.121:8899
2024-02-23 15:30:32.854 WARNING (SyncWorker_10) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [1] retry attempts left
2024-02-23 15:30:32.854 INFO (SyncWorker_10) [custom_components.solarman.solarman] Connecting to solarman data logger 192.168.10.121:8899

There does seem to be a rough correlation between the 'Disconnecting from solarman data logger', and the count of close waits rising. That change 'Improve reliability by {dis/re}connecting on V5FrameError' was added 6 months ago. I hadn't updated HA/Solarman in a while before, so almost certainly didn't have that in then.

If anyone could run 'sudo netstat -anp | grep CLOSE_WAIT' on their HA box and see if they're seeing the same thing?

kenchy commented 4 months ago

I've commented out the call to modbus.disconnect to see if the problem goes away, gets worse or causes different issues. I'll report back when I know anything.

jmccrohan commented 4 months ago

I've commented out the call to modbus.disconnect to see if the problem goes away, gets worse or causes different issues. I'll report back when I know anything.

There is obviously different behavior between the different versions of Solarman data logger. Can you provide more details about the firmware version on your logger?

For reference, attached is my output below running MW_08_512_0501_1.94. This what I based the commit referenced above on.


~ # netstat -anp | grep 8899      
tcp        0      0 <HA_HOST>:34446      <DATA_LOGGER>:8899     ESTABLISHED -

kenchy commented 4 months ago

I'm on MW3_15_0501_1.17 and Extended System version V1.1.00.0B, I didn't comment the right disconnect, so I'm going to let the CLOSE_WAITs get to a point where they seem to cause more issues (I'm still not sure what that level it, and then try again.

I never experienced unreliability with this integration, what were the symptoms that made you put that in? Is there an issue for it?

What version of Python is your system running, mines on 3.11.2?

dtuser007 commented 4 months ago

same here with Deye 600G, running: Solarman 1.5.1 Core 2024.2.4 Supervisor 2024.02.1 Operating System 12.0 Frontend 20240207.1

Looks very ugly in the log, but data is still aquired from the rig. However, I am not able to figure out the data you talked about above. Sorry.

sofkaski commented 4 months ago

One observation from my system which run HAOS on Raspberry 4 and Solarman 1.5.1. (Core 2024.2.5, Supervisor 2024.02.1, Operating System 12.0, Frontend 20240207.1)

The inverter is hybrid Deye SUN-8K-SG04P3-EU with a logger with firmware version LSW3_15_FFFF_1.0.9E. I have connected it also with Deye cloud.

I booted up my HA instance and the integration immediately failed to connect with the logger. (Querying [3 - 89] failed with exception [NoSocketAvailableError: No socket available])
Connecting the logger with browser failed as well
Deye cloud was updated still
I reset the logger by using the button on the logger -> integration started working and also the connection with browser was ok
After a bit more than 3 hours the same failure appeared again
Logger reset -> integration started working again
At this point there is only one socket in CLOSE_WAIT state (tcp 5835 0 \<HA IP>:40342 \<logger IP>:8899 CLOSE_WAIT - )

kenchy commented 3 months ago

So I commented out the relevant calls to self.disconnect_from_server(), so far I've seen 3 times (in 8 hours) in the logs where it would have been called and no CLOSE_WAIT. Solismod did have about 10 established connections (from about 2 weeks uptime), so I restarted that as well and I see if those start to build too.

sofkaski commented 3 months ago

Some more observations on this case.

... from my system which run HAOS on Raspberry 4 and Solarman 1.5.1. (Core 2024.2.5, Supervisor 2024.02.1, Operating System 12.0, Frontend 20240207.1)

The inverter is hybrid Deye SUN-8K-SG04P3-EU with a logger with firmware version LSW3_15_FFFF_1.0.9E. I have connected it also with Deye cloud.

It seems that the logger resets the TCP connection every 3 hours:

2024-03-09 22:20:15.283 DEBUG (Thread-2 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxx] Connection RESET by peer. 2024-03-10 01:20:35.848 DEBUG (Thread-3 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxxx] Connection RESET by peer. 2024-03-10 04:20:36.321 DEBUG (Thread-4 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxxx] Connection RESET by peer.

Retries can recovers these, but then (in this case the fourth 3 hour period) there is the following:

2024-03-10 07:20:52.445 WARNING (SyncWorker_26) [custom_components.solarman.solarman] Querying [644 - 653] failed with exception [Empty: ]

To me this smells like the logger is leaking some resources inside it and eventually fails.

I have now a version running where the TCP connection is closed after every 30 s polling round. Let's see how it behaves.

sofkaski commented 3 months ago

8 hours without errors ... Let's see how this goes further.

kenchy commented 3 months ago

I've taken a different approach now, just commenting out the disconnect resulted in a failure after about a week. I've increased the retries to 3, and added a 10 second delay, and put the self.disconnect back in in, and it's behaving differently.

2024-03-12 11:53:52.222 WARNING (SyncWorker_9) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-03-12 11:53:52.223 WARNING (SyncWorker_9) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [2] retry attempts left
2024-03-12 11:54:02.221 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.solarman_inverter_status is taking over 10 seconds
2024-03-12 14:53:52.684 WARNING (SyncWorker_18) [custom_components.solarman.solarman] Querying [33029 - 33095] failed with exception [NoSocketAvailableError: Connection closed on read]
2024-03-12 14:53:52.684 WARNING (SyncWorker_18) [custom_components.solarman.solarman] Querying [33029 - 33095] failed, [2] retry attempts left
2024-03-12 14:54:02.674 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.solarman_inverter_status is taking over 10 seconds

So it looks like the 10 second wait and then the retry is working. But as the self.disconnect is there, if it hits an actual problem it will disconnect. That's the theory anyway. It might be risky as I think there's a 1 minute time from homeassistant? I've definitely been hit by that in the past. Currently no CLOSE_WAIT being generated so we'll see.

sofkaski commented 3 months ago

My attempted workaround for the logger problem did not work. The failure comes a bit randomly between 3 - 8 hours from start. I also tried with 60 seconds polling interval. No luck. Also, I tried with the latest core 2024.3.1 and the result is the same.

I am still of the opinion that in my case the root cause is in the logger stick (or inverter) and not in HA nor in the solarman integration. My reasoning is that the browser connection fails at the same time with the same symptom. So the question is what causes the stick to jam and is there any workaround in the solarman integration side?

kenchy commented 3 months ago

I think that you are right, except that the errors did not occur on the old version, the fault still lies with the logger though. I think this change comes from an update to pysolarman, but I haven't looked at why those changes were put in. My workaround is working so far but then mine failed on a weekly basis, not 3-8 hours.

(I put import time at the top of solarman.py, and then a couple of minor mods in this section

           log.warning(f"Querying [{start} - {end}] failed with exception [{type(e).__name__}: {e}]")
           #self.disconnect_from_server()
           if result == 0:
             log.warning(f"Querying [{start} - {end}] failed, [{attempts_left}] retry attempts left")
             time.sleep(10)

)

I think thing that my logger wants to avoid is disconnects, unless there is an issue. Too many disconnects and then I start seeing issues.

kenchy commented 3 months ago

Well with the modifications I made it's been running for about a week. Twice in that week it's run through the retry cycle and then disconnected. (Several times a day it fails on the first try and then succeeds on the second). No close waits are building, a success so far. If anyone who is interested is reading this, would you consider a change like this? I'm going to leave it for a few days, and then try again with a 5 second retry delay, I'm seeing references to a 30 second time out, which it might be hitting (but it gets to the disconnect eventually (2-3mins), so 5 seconds and maybe reducing it back to 1 retry might help that.

jmccrohan commented 3 months ago

Well with the modifications I made it's been running for about a week. Twice in that week it's run through the retry cycle and then disconnected. (Several times a day it fails on the first try and then succeeds on the second). No close waits are building, a success so far. If anyone who is interested is reading this, would you consider a change like this? I'm going to leave it for a few days, and then try again with a 5 second retry delay, I'm seeing references to a 30 second time out, which it might be hitting (but it gets to the disconnect eventually (2-3mins), so 5 seconds and maybe reducing it back to 1 retry might help that.

I'm happy to take a look to ensure that your changes continue to work okay on my logger (which exhibits different behaviour). Have you pushed your changes somewhere for others to review?

kenchy commented 3 months ago

I've just created a repo with the changes, such as they are!

https://github.com/StephanJoubert/home_assistant_solarman/compare/main...kenchy:home_assistant_solarman:main

darkman82 commented 3 months ago

Hi guys, I had the same problem so I took a look at solarman.py and currently testing some changes. I'm sorry having read this too late as we came to the same solution.

The problem is that solaman.py disconnects/reconnect on each read failures, and I have [CRCError: CRC validation failed.] quite often. Plus the logger itself force disconnection every 300 secs (you can set it up to 600 from http://loggerIP/config_hide.html)

I think there is no need to disconnect and reconnect on each failure, especially when the pysolarmanv5 library can take care of the TCP connection itself by using auto_reconnect=True.

So basically I re-arranged the code by removing all disconnect() calls (I left one for the main failure) and by placing a tiny sleep between read failures, so that the subsequent reads could be performed in a slightly different condition (poor man solution, I don't know what's going on the logger and already tried two of them with different fw and starting serial no).

In a nutshell:

I let the pysolarmanv5 library taking care of the connection by removing all disconnect calls
I placed sleep (1 sec) between failures, to give a better chance to the subsequent read I also:
increased the retry count from 2 to 3
increased TCP Timeout on the logger to 600 (so that we can use the same TCP connection up to 10m, failures permitting)
set v5_error_correction=True
optimized the for/while routine (not related to the issue btw)

After a few hours: No more building CLOSE_WAIT TCP Connection. Still CRC errors failure every 1-2 min, but the subsequent read has 90% of successful rate

I'm going to test it for a week or so, but for now I don't see "unknow values" every 2-3 min in HA, and this is a great achievement!!

If someone else would like to give it a try, I'm attaching the changes here:

remember to import time

    def get_statistics(self):
        result = 1
        params = ParameterParser(self.parameter_definition)
        requests = self.parameter_definition['requests']
        log.debug(f"Starting to query for [{len(requests)}] ranges...")

        try:
            for request in requests:
                start = request['start']
                end = request['end']
                mb_fc = request['mb_functioncode']
                log.debug(f"Querying [{start} - {end}]...")
                attempts_left = QUERY_RETRY_ATTEMPTS
                while attempts_left > 0:
                    attempts_left -= 1
                    try:
                        self.connect_to_server()
                        self.send_request(params, start, end, mb_fc)
                        log.debug(f"Querying [{start} - {end}] succeeded")
                        break
                    except Exception as e:
                        log.warning(f"Querying [{start} - {end}] failed with exception [{type(e).__name__}: {e}] - [{attempts_left}] retry attempts left")
                        if attempts_left == 0:
                           log.warning(f"Querying registers [{start} - {end}] failed, aborting.")
                           self._current_val = {} # Clear cached previous results to not report stale and incorrect data
                           result = 0
                           break
                        else:
                          time.sleep(1) # Give some time to the logger to rest from errors (maybe useless)
                if result == 0: # early exit
                   break
            if result == 1:
                log.debug(f"All queries succeeded, exposing updated values.")
                self.status_lastUpdate = datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
                self.status_connection = "Connected"
                self._current_val = params.get_result()
        except Exception as e:
            log.warning(f"Querying inverter {self._serial} at {self._host}:{self._port} failed on connection start with exception [{type(e).__name__}: {e}]")
            self.status_connection = "Disconnected"
            # Clear cached previous results to not report stale and incorrect data
            self._current_val = {}
            self.disconnect_from_server()

sofkaski commented 3 months ago

@darkman82 I have now your fix running in my environment (Deye SUN-8K-SG04P3-EU with a logger with firmware version LSW3_15_FFFF_1.0.9E.).

I have not seen CRC errors in my environment.

Btw, I did not yet change the TCP timeout.

sofkaski commented 3 months ago

That attempt failed after running bit more than an hour to Querying [3 - 89] failed with exception [Empty: ]. At that point there was an open TCP connection to the logger:

# netstat -anp | grep 8899
tcp        0    648 192.168.68.121:43522    192.168.68.114:8899     ESTABLISHED 3330/python3

The logger did not respond to ping (what it does when it is ok). No CLOSE_WAIT visible.

I think that my case is a different, but likely related problem. What I have so far understood from my various debugging attempts:

I have been running the integration with 120 s interval. In my case I am currently only collecting measurements and the near real-time data is not that relevant.
the logger resets the TCP connection every 3 hours. Pysolarman recovers from peer resets successfully.
When the queue.Empty exception is got, the integration gets stuck to this error in subsequent rounds
The frequency of getting the Empty exception is not very deterministic
Why the socket is empty after sending (and before the timeout) is a mystery to me. This sounds like some problem in the logger.
When disabling the integration the logger recovers in couple of minutes (= e.g starts responding to ping or you could again enable integration successfully). I have not figured out what is the exact time. Could it be the 300s TCP timeout setting in logger, or the time a TCP connection is in CLOSE_WAIT state on the logger end?
From HA side there are no open connections to the logger after disabling the integration. The logger side could be in CLOSE_WAIT state.

I have tried a couple of recovery actions for the Empty exception in Pysolarman that have not worked:

closing the connection and setting up a new. The new connection can not be established since logger is (still) stuck.
re-sending the last message after a delay. The exception in receive is got again.

The next thing I'll try is to reduce the TCP timeout on the logger to be less than the polling interval and closing and opening the connection when the error happens. If the hypothesis about recovery time is correct then the integration should recover also from this error.

darkman82 commented 3 months ago

Just crashed again with: [NoSocketAvailableError: Connection closed on read] (the stick may have closed the TCP connection at the timeout expiration) THEN [NoSocketAvailableError: Connection already closed.] for any subsequent read attempt.

This means the solarmanpy5 library doesn't handle well the "auto reconnect" functionality.

I'm going to change the code again to catch this exception and force a reconnection...

darkman82 commented 3 months ago

@sofkaski unfortunately we can't see what's going on the logger, but I never experienced your issue with mine. It seems that your logger is getting flooded by the integration then dies for a while and this is your MAIN issue.

On my logger I have:

disabled AP+STA mode by setting AP/STA, to stabilizes the WIFI connection;
removed all server data to avoid it sending information to vendor cloud;
increased the TCP timeout to 600sec (10min - maximum) to reduce the number of TCP socket opening
updated to the latest firmware (it need to be asked by email to the vendor)

I read values every 15 secs (default).

Are you sure your logger reset the TCP connection every 3 hours?? this is strange with 300s as TCP timeout. I get Empty exception only when the logger closes the connection (or the connection is lost) while the integration is reading.

Btw, although I tried two different loggers with different FW and different starting serial numbers, I have a lot of CRC errors. So I think it could be a problem of the inverter, BUT I'm still able to have correct reads by re-trying the reading.

The problem with HA is that the current retry mechanism implied disconnection/reconnection and after a while I was getting too many TCP stale sockets (there must be a leak in the pysolarmanv5 v3.0.1 library).

Here is the new code to handle the auto-reconnect failure (still working on it btw):

    def get_statistics(self):
        params = ParameterParser(self.parameter_definition)
        requests = self.parameter_definition['requests']
        log.debug(f"Starting to query for [{len(requests)}] ranges...")
        try:
            for request in requests:
                start = request['start']
                end = request['end']
                mb_fc = request['mb_functioncode']
                log.debug(f"Querying [{start} - {end}]...")
                attempts_left = QUERY_RETRY_ATTEMPTS
                while True: # we check the condition for breaking inside
                    try:
                        self.connect_to_server()
                        self.send_request(params, start, end, mb_fc)
                        log.debug(f"Querying [{start} - {end}] succeeded")
                        break
                    except Exception as e:
                        attempts_left -= 1
                        if attempts_left > 0:
                          log.warning(f"Querying [{start} - {end}] failed with exception [{type(e).__name__}: {e}] - [{attempts_left}] retry attempts left")
                          time.sleep(1) # Give some time to the logger to rest from errors (maybe useless)
                        else:
                          log.warning(f"Querying registers [{start} - {end}] failed with [{type(e).__name__}: {e}] after {QUERY_RETRY_ATTEMPTS} attempts, aborting.")
                          raise # early exit
            log.debug(f"All queries succeeded, exposing updated values.")
            self.status_lastUpdate = datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
            self.status_connection = "Connected"
            self._current_val = params.get_result()
        except Exception as e:
            log.warning(f"Querying inverter {self._serial} at {self._host}:{self._port} failed with exception [{type(e).__name__}: {e}]")
            self.status_connection = "Disconnected"
            self._current_val = {} # Clear cached previous results to not report stale and incorrect data
            if type(e).__name__ == "NoSocketAvailableError" and str(e) == "Connection already closed.":
               self.disconnect_from_server() # pysolarmanv5 was unable to reconect properly!! auto_reconnect did NOT work as expected

sofkaski commented 3 months ago

Are you sure your logger reset the TCP connection every 3 hours?? this is strange with 300s as TCP timeout. I get Empty exception only when the logger closes the connection (or the connection is lost) while the integration is reading.

Yes. It puzzles me where that 3 hours comes.

Apr 01 01:16:37 homeassistant homeassistant[567]: 2024-04-01 04:16:37.147 DEBUG (Thread-57 (_data_receiver)) [custom_components.solarman.solarman] [2799531105] Connection RESET by peer.
Apr 01 04:16:37 homeassistant homeassistant[567]: 2024-04-01 07:16:37.149 DEBUG (Thread-75 (_data_receiver)) [custom_components.solarman.solarman] [2799531105] Connection RESET by peer.
Apr 01 07:16:37 homeassistant homeassistant[567]: 2024-04-01 10:16:37.186 DEBUG (Thread-93 (_data_receiver)) [custom_components.solarman.solarman] [2799531105] Connection RESET by peer.

I increased polling period to 10 min which is way over the suspected 'recovery' time of the logger. Now the integration has been running 17 hours without the Empty exception surfaced. (Peer resetting every 3 hours, but pysolarman recovering)

kenchy commented 3 months ago

I also see the 3 hour thing in the log. After putting my changes in as detailed in https://github.com/StephanJoubert/home_assistant_solarman/issues/523#issuecomment-2009979324 I have been running ever since with no issues. Well I see some connection issues in the logs, but the system always recovers quickly and there no increasing connections.

I think we are all seeingslightly different issues due to different firmwares, inverters etc. However am I correct in saying that it all worked fine in the older version? I think the issues started with the change to pysolarman, so should we be looking there, rather than trying to work around issues with solarman?

sofkaski commented 3 months ago

I joined the party late only with latest versions so I can't comment if the old version would have worked for me, but symptoms indeed hint that the problem may be about connection management.

Actually I found one suspicious thing from pysolarmanv5. When closing the connection an empty string is written to the socket. If the peer has closed the connection it will throw an exception before socket.close() is called. I have now been running 6 hours successfully a version where closing is done using socket.shutdown(...) + socket.close() + a bit more debug logging. These versions are here: pysolarmanv5 and solarman.

Integration is running with its normal 30 s period.

Let's see how this survives. So far so good.

darkman82 commented 3 months ago

I can confirm that pysolarmanv5 doesn't handle the auto-reconnect properly, as it is unable to revive the TCP connection once it raises [NoSocketAvailableError: Connection already closed.].

As a workaround in solarman I use disconnect_from_server() when I catch it.

sofkaski commented 3 months ago

Let's see how this survives. So far so good.

Not good any more. Failed again with first getting the queue empty exception and then NoSocketAvailable. Need to investigate further.

darkman82 commented 3 months ago

Let's see how this survives. So far so good.

Not good any more. Failed again with first getting the queue empty exception and then NoSocketAvailable. Need to investigate further.

What's the description text of the NoSocketAvailableError ?

I'm fine since I put the latest code above, more than 48h of uptime.

darkman82 commented 2 months ago

Hi guys, just to let you know that with the fix I published above, the system is working since 2 weeks ago.

sofkaski commented 2 months ago

An update from me. I have now been running tweaked versions of solarman and pysolarmanv5 about a week with 10 min update interval successfully. 10 min update interval is ok for my use case since solar production changes rather slowly according to time of year, daytime and weather.

My thinking from previous attempts was that the logger leaks resources while the TCP connection is open and eventually fails. When I was running the integration with an update interval longer than the logger TCP timeout (300 s default) that error did not surface any more. But there were peer resets, and recovery from them failed after couple of hours.

Then I returned back to my idea of closing the connection at the end of update round and then re-establishing the connection in the beginning of the next round. The overhead is not that big (100 ms in each update round that takes about 1200 ms). This seems to work best so far. There are still failures about once a day where the socket connection fails, but the next update round succeeds.

As reminder my environment has Deye SUN-8K-SG04P3-EU inverter and a logger with firmware version LSW3_15_FFFF_1.0.9E. (The logger FW version is the latest according to the answer from Solarman customer service)

jmccrohan commented 2 months ago

@sofkaski I'm happy to incorporate your fixes into the next release of pysolarmanv5. Submit a PR and we'll take it from there.

sofkaski commented 2 months ago

@sofkaski I'm happy to incorporate your fixes into the next release of pysolarmanv5. Submit a PR and we'll take it from there.

I'll do that.

theyo-tester commented 2 months ago

I'm having the same problems with HomeAssistant and the Datalogger. It seems that the changes/fix by @sofkaski got merged into the latter versions of pysolarmanv5👍.

But how do I update it in HomeAssistant? Do I just update the library in the HA terminal with ➜ ~ "pip install pysolarmanv5" ?

The library seems to be the latest but, but I suspect that the custom component is not using this one?! because I am still getting "CLOSE_WAIT" connections after some hours.......

sofkaski commented 2 months ago

But how do I update it in HomeAssistant? Do I just update the library in the HA terminal with ➜ ~ "pip install pysolarmanv5" ?

I have done this by rebuilding the HA core. That pulls in the (latest) dependencies. So ha core rebuild on the command line.

You can check what version of pysolarman is in use in the HA instance e.g. this way:

Start the terminal to get to the command line
Start a bash shell in the homeassistant container: docker exec -it homeassistant /bin/bash
List packages in use and check pysolarman python3 -m pip list | grep pysolarman

theyo-tester commented 2 months ago

thanks for the answer. Oh, I thought that I am already inside the correct container, but it was the supervisor, I think. But anyhow, it seems that I am already using the last version: python3 -m pip list | grep pysolarman pysolarmanv5 3.0.2

But maybe something happened meanwhile, after multiple starts, because I don't get any "CLOSE_WAIT" connections anymore 👍👍

jouster1974 commented 2 months ago

for clarity, this is still an issue on 2024.5

otyyys commented 2 weeks ago

Some more observations on this case.

... from my system which run HAOS on Raspberry 4 and Solarman 1.5.1. (Core 2024.2.5, Supervisor 2024.02.1, Operating System 12.0, Frontend 20240207.1) The inverter is hybrid Deye SUN-8K-SG04P3-EU with a logger with firmware version LSW3_15_FFFF_1.0.9E. I have connected it also with Deye cloud.

It seems that the logger resets the TCP connection every 3 hours:

2024-03-09 22:20:15.283 DEBUG (Thread-2 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxx] Connection RESET by peer. 2024-03-10 01:20:35.848 DEBUG (Thread-3 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxxx] Connection RESET by peer. 2024-03-10 04:20:36.321 DEBUG (Thread-4 (_data_receiver)) [custom_components.solarman.solarman] [xxxxxxxx] Connection RESET by peer.

Retries can recovers these, but then (in this case the fourth 3 hour period) there is the following:

2024-03-10 07:20:52.445 WARNING (SyncWorker_26) [custom_components.solarman.solarman] Querying [644 - 653] failed with exception [Empty: ]

To me this smells like the logger is leaking some resources inside it and eventually fails.

I have now a version running where the TCP connection is closed after every 30 s polling round. Let's see how it behaves.

Hello,

I realized that my logger in Sofar inverter disconnects and reconnects to my wifi ap exacly every 3 hours. My speculation is that full or partial logger restart is the odd way how to make things more stable, if you are not able to fix the bugs. Most of my read errors correlates to wifi disconnect events.

Not mentioning the poor wifi radio quality at all.

StephanJoubert / home_assistant_solarman

Increasing number of CLOSE_WAIT TCP connections plus increasing number of connection errors in the ha log #523