Cannot get file/Cannot get watercare, protocol retry count exceeded

Version of the custom_component

0.1.8

Configuration

GUI configuration

Describe the bug

Errors are being logged, presumably trying to download SpaPackStruct.xml. It appears to be attempting to download it. Do I need to do it manually?

Cannot get file, protocol retry count exceeded
Cannot get watercare, protocol retry count exceeded

Exception is thrown AssertionError - assert self.facade is not None

Debug log


2022-05-21 15:52:58 ERROR (MainThread) [geckolib.async_spa] Cannot get watercare, protocol retry count exceeded
2022-05-21 15:54:42 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded
2022-05-21 15:55:47 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded
2022-05-21 15:56:51 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded
2022-05-21 15:57:21 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/geckolib/driver/udp_protocol_handler.py", line 155, in consume
    await self.async_handled(sender)
  File "/usr/local/lib/python3.9/site-packages/geckolib/driver/udp_protocol_handler.py", line 121, in async_handled
    await self._async_on_handled(self, sender)
  File "/usr/local/lib/python3.9/site-packages/geckolib/async_spa.py", line 580, in _async_on_wcerr
    await self._event_handler(GeckoSpaEvent.RUNNING_SPA_WATER_CARE_ERROR)
  File "/usr/local/lib/python3.9/site-packages/geckolib/async_spa_manager.py", line 467, in _handle_event
    assert self.facade is not None
AssertionError

After a while, I also started getting the error cannot get full struct. I don't know if that indicates there was a partial download.

2022-05-21 22:12:09 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded.

It looks like it is struggling to talk to the in.touch CO module and is sometimes unable to get a complete set of values for the configuration and run state of the spa. There are quite a few reasons why this might happen.

The in.touch CO module is too far from the EN module. Check the RF signal value.
There is RF interference. Try changing the RF channel in the Gecko app.
There are several clients on the LAN (Normally just HA and your Gecko app, but sometimes there are several iOS/Android clients). Try closing all the apps on the mobile devices and disable background refresh.
The EN module gets confused sometimes and sends packets to the wrong clients, even if they are no longer active. Disconnect the EN module from power and LAN for a couple of mins and then reconnect.
Try re-pairing the modules for a full reset of the environment (paper clip in the EN module pinhole during power up)
The code could retry for longer. You’d need to fiddle with the values in the config.py file in the site-packages folder. Currently, PROTOCOL_RETRY_COUNT = 10 but you could increase it to see if that affects your experience. Change both places in that file where it’s set, you could try 15 or 20, but it will really slow down the response time of the spa. This would be a last resort change.

We moved away from needing to download the SpaPackStruct.xml file a while back, this has been replaced by a bunch of hard coded modules for each pack type and version.

Fingers crossed one of these points helps.

It looks like it is struggling to talk to the in.touch CO module and is sometimes unable to get a complete set of values for the configuration and run state of the spa. [...] We moved away from needing to download the SpaPackStruct.xml file a while back, this has been replaced by a bunch of hard coded modules for each pack type and version.

Oh ok! I was thrown by Cannot get file and assume it was related to the old SpaPackStruct.xml rituals. I just turned my spa back on after the winter shutdown and got to appreciate all the great work you've done on it. The integration is looking very nice at this point.

Now that I know those are timeouts/failed polls on the RF side, I can interpret them better. It seems to be a transient issue that doesn't occur that often. As an FYI here's the full set of errors logged yesterday:

2022-05-22 00:25:50 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded
2022-05-22 00:26:55 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded
2022-05-22 01:06:58 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:08:44 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:10:03 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:11:15 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:12:31 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:13:52 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:15:21 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:16:51 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:17:53 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:19:42 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:20:58 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:22:29 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:23:32 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:24:49 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:26:17 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:28:39 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:29:57 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:30:50 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:32:19 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:33:22 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:34:40 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:35:43 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:36:42 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:38:13 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 01:39:19 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded
2022-05-22 12:39:24 ERROR (MainThread) [geckolib.async_spa] Cannot get reminders, protocol retry count exceeded
2022-05-22 12:41:52 ERROR (MainThread) [geckolib.async_spa] Cannot get full struct, protocol retry count exceeded

sensor.my_spa_radio shows that normally the signal level is around 28%, but it dropped to 1% between 12:27 and 01:39

Thanks for that sensor. I can keep an eye on the RF environment.

A small enhancement request, it would be easier to graph and automate if signal level was a straight number, and channel number was in it's own attribute or separate sensor.

Thanks for all your hard work.

UPDATE - I remembered that you also gave us a very helpful last ping sensor. So I took a look at that and saw that pings failed between 01:05 and 01:39 so connectivity was effectively lost -- though I don't know why.

Note:

The last ping sensor is useful. Though I'm wondering if it could be improved to be more readable and generate less rows in the recorder db?

If I understand things correctly, if there is a once a minute ping, when things are working correctly, a new state row with the datetime value of that sensor will get written to the database every minute. (One could all config to the recorder to ignore that sensor via yaml).

Probably one of the more efficient representations would be to change that sensor into a "not responding to pings" binary sensor. The state/event tracking in the platform will then track when it started and when it stopped responding. It would also then be easy to set an automation to alert when the state is on for more than X minutes or something like that. (I'm just thinking off the top of my head)

but it dropped to 1% between 12:27 and 01:39

That'll do it! If the RF signal is low then the behavior is very erratic and I'd expect exactly the kinds of logs that you're seeing. Did you try changing channels, or is there any way you get get your EN module closer or better aligned with the CO one?

If there is some episodic interference that is trashing 30% of your signal then if you can get the signal default up to 50% it might not be so catastrophic.

I've found that moving the CO module around under the spa skirt can dramatically alter the signal strength. Also the orientation of the devices seems to have an effect too.

A small enhancement request, it would be easier to graph and automate if signal level was a straight number, and channel number was in it's own attribute or separate sensor.

Nice idea, consider it added to the backlog

Probably one of the more efficient representations would be to change that sensor into a "not responding to pings" binary sensor.

I think this is a cool idea, thank you. Again, consider it on the backlog.

If I understand things correctly, if there is a once a minute ping

It's every two seconds when the spa is "active" otherwise it is one minute. I'd forgotten that all sensor data is written to the recorder, so that might be the source of my remaining high CPU issue. I might consider deprecating that sensor then if I can't find a way to exclude it from the recorder as default (not requiring any YAML editing).

but it dropped to 1% between 12:27 and 01:39

That'll do it! If the RF signal is low then the behavior is very erratic and I'd expect exactly the kinds of logs that you're seeing. Did you try changing channels, or is there any way you get get your EN module closer or better aligned with the CO one?

For the past year or two I wasn't aware of any specific connectivity issues, other than below, but now that I have the tools I can work to improve the signal. I also have to see if perhaps the Gecko Steamlinx could be a source of interference. They are both probably too close to other network equipment that might be generating RFI. I've extended my RTL-SDRs and Zwave sticks to get them away from things like the RPIs.

I have not tried changing channels yet. I didn't see that capability when I set it up initially. It would be interesting to see if changing the channel actually changes the frequency. I've seen a number of RF devices (weather, temperature/humidity, where they use the term "channel" as the ID (or part of the ID) rather than a frequency change.

If there is some episodic interference that is trashing 30% of your signal then if you can get the signal default up to 50% it might not be so catastrophic.

Understood. I've seen with my 433mhz devices and rtl_433 that I do have occasionally issues, often in the dead of night where the noise floor suddenly goes up for a while and blocks reception from marginal devices. I haven't been able to detect any patterns that would lead to the culprit. Many less devices are active during those times, laptops, monitors, TVs, solar inverters, lighting, etc. are all off/standby.

I've found that moving the CO module around under the spa skirt can dramatically alter the signal strength. Also the orientation of the devices seems to have an effect too.

Depending on the type of antenna within the units, orientation could matter quite a bit.

A small enhancement request, it would be easier to graph and automate if signal level was a straight number, and channel number was in it's own attribute or separate sensor.

Nice idea, consider it added to the backlog

Probably one of the more efficient representations would be to change that sensor into a "not responding to pings" binary sensor.

I think this is a cool idea, thank you. Again, consider it on the backlog.

If I understand things correctly, if there is a once a minute ping

It's every two seconds when the spa is "active" otherwise it is one minute.

What is the criteria for active - one or more pumps running?

I'd forgotten that all sensor data is written to the recorder, so that might be the source of my remaining high CPU issue. I might consider deprecating that sensor then if I can't find a way to exclude it from the recorder as default (not requiring any YAML editing).

Some of the changes they've made in recent releases to reduce DB size and improve performance of things like the front end have got me rethinking some of my sensors. Sensors that continually change state but don't offer enough information value seem like good candidates for rethinking. I've also reduced some unnecessary precision from some sensors to reduce the number of state changes. I had some devices that would report things like volts and watts with 2-3 decimal places. Small changes in those values generated constant state updates, but were fairly meaningless for devices at mains voltage.

I think the question is what kind of information do you want to get from the ping sensor and how often? A binary sensor captures the information about whether there is currently connectivity or not. The built in state change tracking for the binary sensor should provide a good history of whether there are periodic / intermittent problems. If things are working, the state isn't changing. If connectivity is lost the state changes once.

Another idea, a sensor the accumulates a count of missed pings could be useful. It would need to reset periodically. I don't know if that really provides more information/value over the binary sensor.

NOTE: While my issue is resolved, I'm reluctant to close this because you've got some good info in here on interpreting those errors that could be helpful to others.

EDIT: Maybe it would be worth enabling discussions on this repo to keep issues cleaner - https://docs.github.com/en/discussions/quickstart

Another suggestion to avoid others opening similar issues - Reword error messages like Cannot get file to make the cause of the problem a little clearer.

I will open an issue for the radio and ping sensors as you suggested.

At least according to the iOS app, I've resolved my signal issues. The iOS app shows in.touch network signal strength between 84% and 94%. The values do change as I move the EN module around. However, I can't tell that from Home Assistant, because I think I'm seeing a similar problem as #56

No protocol errors have been logged for almost 4 days, yet my radio sensor has been stuck at 1% since almost a minute after the last protocol error.

2022-05-28 22:28:29 ERROR (MainThread) [geckolib.async_spa] Cannot get file, protocol retry count exceeded

sensor.my_spa_radio changed to 1% (down from 30%) at 2022-05-28 22:29:25

However the last_ping sensor is still updating every minute and is current. Other entities are current/updating, just not the radio sensor.

The ping sensor did show a 13.5 minute gap where connectivity was broken starting around 22:15, but after 22:29:25 pings were succeeding.

This has gone stale, closing this.

gazoodle / gecko-home-assistant