eifinger / hass-foldingathomecontrol

Homeassistant integration for FoldingAtHomeControl
MIT License
19 stars 4 forks source link

Connection after restart #2

Closed TargetCrafter closed 4 years ago

TargetCrafter commented 4 years ago

Version of the custom_component

v0.0.1

I had the integration working yesterday, but i shutdown the pc overnight, and now it's showing a status from 15 hours ago. Even after restarting HA, still no result.

eifinger commented 4 years ago

Thank you for opening an issue. I will look into it.

TargetCrafter commented 4 years ago

I just changed the configuration to only use 8 cores on my cpu, and the sensor in HA changed accordingly. So it seems like there is still a connection between the PI and my pc, it's just not updating the status.

After some more digging, it looks like the addon is adding a new sensor every time you change a setting like amount of cores used. I first change the setting to 4 cores, after wich i changed it to 8 cores, and in HA i have 3 CPU sensors now: cpu:15, cpu:4 and cpu:8.

Not that i'm changing the setting every day, but i'd be nice if this would be cleaned up to prevent having loads of useless sensors.

eifinger commented 4 years ago

Thanks for the feedback. I created a new Issue to solve that. Cant promise when its done.

As for your initial problem. Are you sure that the PC was working on a workunit and thus the state should have changed?

TargetCrafter commented 4 years ago

Yes, the pc was running a calculation at that time, so it should've changed from paused to running.

It's also just showing a status from 14 hours ago, eventough i just ran a task this morning: image

eifinger commented 4 years ago

I could reproduce your error and have a fix. I still struggle to produce a unit test^^

JanosAudron commented 4 years ago

Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.

TargetCrafter commented 4 years ago

Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.

Yes, rebooting HA fixes the problem, but i don't want to restart HA every time i restart my computer..

I could reproduce your error and have a fix. I still struggle to produce a unit test^^

If you need help testing, i'm willing to help.

eifinger commented 4 years ago

I Just published v0.1.0. This should fix the reconnection issue.

TargetCrafter commented 4 years ago

Seems to be working now, will monitor over the next couple of days.

Thanks @eifinger

JanosAudron commented 4 years ago

Even here seems working well, thank you. I would suggest, in case of lost connection, to reset counters (ETA, next attempt, etc.) and change current state to something else, meybe OFFLINE, instead staying stock to last reading (usually STOPPING)

JanosAudron commented 4 years ago

After a while, connection is stuck but I don't know why. I tried to reboot the server and seemed working well, but now is stuck...

JanosAudron commented 4 years ago

Again, rebooted the PC the connection didn't be enstablish again

eifinger commented 4 years ago

Do you see anything in the logs?

JanosAudron commented 4 years ago

No, nothing in the log. You can reproduce the issue detaching the cable from the PC.

TargetCrafter commented 4 years ago

Just had the same thing, after a pc restart both sensors are stuck in the PAUSED state. Just like @JanosAudron nothing in the logs.

eifinger commented 4 years ago

Guess I'll be pulling some Ethernet cables tonight^^

Thank you for providing feedback!

eifinger commented 4 years ago

Found the bug. Fix is on its way.

eifinger commented 4 years ago

v0.1.1 is out and should finally fix this!

TargetCrafter commented 4 years ago

Hate to be that guy, but... Just restarted my pc, and it is stuck on paused again.. After restarting HA it picks up the correct status again.

eifinger commented 4 years ago

@TargetCrafter do you use a password for the FoH Client on the PC that you restarted?

TargetCrafter commented 4 years ago

I did, but now that you talked about it, i tried to see if the issue still persists without a password, which is the case unfortunately..

eifinger commented 4 years ago

It seems there is a race condition somewhere. I will keep investigating!

eifinger commented 4 years ago

While I am trying to debug this, could you please include the following log configurations:

Either as data to the service call logger.set_level:

custom_components.foldingathomecontrol: debug
FoldingAtHomeControl: debug

Or in configuration.yaml:

logger:
  default: warning
  logs:
    custom_components.foldingathomecontrol: debug
    FoldingAtHomeControl: debug
eifinger commented 4 years ago

Just released v0.1.2. I tested it with two of my pc, pulling cables, turning off WIFI and quitting the FoH client.

I so hope this works for everyone now!

TargetCrafter commented 4 years ago

I installed the update as soon as possible, but just now received a task. Unfortunately this update didn't show up until a restart of HA.

I added the provided settings to my configuration.yaml file, where should i see the logs for this? i checked the logbook and dev tools logs, but saw nothing.

I'm starting to think i'm doing something wrong here..

eifinger commented 4 years ago

You should see them in the dev logs. IF there are errors.

For me it's working without errors with the latest version.

JanosAudron commented 4 years ago

One of my PC has been connected with a different IP (changed from wired to wireless connection) for 5-6 hours. Once reconnected with old IP, Home Assistant should recconnect itself but didn't. From log, I have this:

2020-04-12 18:33:35 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:33:48 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:01 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:14 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:29 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.128:36330

JanosAudron commented 4 years ago

Attached at the present, foll log. More or less I reconnected the PC at "2020-04-12 18:34:29", since then no more data in the log, now is 20:07 FAH Log.txt

TargetCrafter commented 4 years ago

@JanosAudron Looks more like a problem on your PC than a problem with this addon. Maybe you didn't add the firewall rule to the profile you're using for you wifi connection?

JanosAudron commented 4 years ago

I have 4 sensors, 1 CPU and 1 GPU on two PCs. I don't think is firewall fault, for two reason. First is, after HA reboot sensors started to work again (see the pictures attached). Second is because, after PC reconnection, we do not see any more logs into the file, like the addon got stock for any reason.

home-assistant.log After reboot CPU After reboot GPU Before reboot CPU Before reboot GPU

eifinger commented 4 years ago

No more logs means it connected successfully again. What version of the integration are you on?

JanosAudron commented 4 years ago

No more logs means it connected successfully again.

I do not agree because at every reconnection you can find a welcome message in the log, but was not the case. DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.

I'm using version v.0.1.4

eifinger commented 4 years ago

@JanosAudron I suspect a race condition is the root cause of your error.

I set the is_connected flag of the serialconnection after the socket was opened but before the welcome message was received.

If the timeout for the whole connection is reached just before the welcome message was fully received, the operation is cancelled but the flag is_connected is still true leading to a wrong state.

I fixed this behavior. It will be included in the next release.

However I cannot be sure that this is really what happened here.

baronKanon commented 4 years ago

If I reboot HASS, I get a disconnect, and need to reinitiate the custom component still :(

eifinger commented 4 years ago

@baronKanon could explain a little bit further?

When you restart Homeassistant, the sensor shows unavailable ? What do your logs tell you when being set as explained here?

baronKanon commented 4 years ago

@eifinger The sensor is unavailable after Home assistant restart, and the debug shows timeout error for connection to the client (on my PC). I usually have some timeouts for connections to external services during startup of HASS, as my RPi 3b+ has some stuff to set up. 20 minutes after initialization of the component, still no updating from the computer available.

I would like to add that I rename the entity ID to something easy to read, but no new entity is created. I guess this initial setup of the socket is also the culprit for not updating after my computer has been turned off for the night.

Are there any heartbeat to the client setup (if no updates for x amount of time - retry socket or something)? How many retries and with what frequency are they run during startup (so the system has the possibility to get up and running before the connection can be established)?

eifinger commented 4 years ago

@baronKanon that is expected behavior. If during startup the connection attempt times out it is assumed that the endpoint is not reachable and is disabled.

However I do see your point in changing this.

Could you please open another issue for that?

JanosAudron commented 4 years ago

I paused client more or less at 10.11am. After 1 hour and half, at 11.46 (read on FoldingAtHome log) I unpaused it, but at that time addon reported a connection timeout. Actually, for 10-20 seconds even FAH client lost the connection as was reporting "updating" Now, calcutation is running but addon is reporting unavailable.

Log from addon:

2020-04-18 17:06:05 WARNING (MainThread) [homeassistant.loader] You are using a custom integration for foldingathomecontrol which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant. 2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the FAHClient command server. 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.

2020-04-19 11:46:15 ERROR (MainThread) [FoldingAtHomeControl.serialconnection] Timeout while trying to read from 192.168.0.1:36330

2020-04-19 11:46:17 ERROR (MainThread) [custom_components.foldingathomecontrol] Disconnected from 192.168.0.1:36330. Trying to reconnect. 2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330 2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330

Log from FoldingAtHome (please consider 2h time shift)

07:29:21:WU00:FS01:0x22:Completed 1760000 out of 2000000 steps (88%) 07:33:39:WU00:FS01:0x22:Completed 1780000 out of 2000000 steps (89%) 07:37:56:WU00:FS01:0x22:Completed 1800000 out of 2000000 steps (90%) 07:37:58:WU02:FS01:Connecting to 65.254.110.245:80 07:37:59:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:37:59:WU02:FS01:Connecting to 18.218.241.186:80 07:37:59:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:37:59:WU02:FS01:Connecting to 65.254.110.245:80 07:38:00:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:00:WU02:FS01:Connecting to 18.218.241.186:80 07:38:00:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:00:ERROR:WU02:FS01:Exception: Could not get an assignment 07:38:01:WU02:FS01:Connecting to 65.254.110.245:80 07:38:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:01:WU02:FS01:Connecting to 18.218.241.186:80 07:38:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:02:WU02:FS01:Connecting to 65.254.110.245:80 07:38:02:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:02:WU02:FS01:Connecting to 18.218.241.186:80 07:38:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:03:ERROR:WU02:FS01:Exception: Could not get an assignment 07:39:01:WU02:FS01:Connecting to 65.254.110.245:80 07:39:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:39:02:WU02:FS01:Connecting to 18.218.241.186:80 07:39:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:39:02:WU02:FS01:Connecting to 65.254.110.245:80 07:39:03:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:39:03:WU02:FS01:Connecting to 18.218.241.186:80 07:39:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:39:03:ERROR:WU02:FS01:Exception: Could not get an assignment 07:40:38:WU02:FS01:Connecting to 65.254.110.245:80 07:40:39:WU02:FS01:Assigned to work server 13.82.98.119 07:40:39:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP106 [GeForce GTX 1060 3GB] 3935 from 13.82.98.119 07:40:39:WU02:FS01:Connecting to 13.82.98.119:8080 Date: 2020-04-19 07:41:25:WU02:FS01:Downloading 22.03MiB 07:41:31:WU02:FS01:Download 10.50% 07:41:37:WU02:FS01:Download 22.98% 07:41:43:WU02:FS01:Download 35.75% 07:41:49:WU02:FS01:Download 48.24% 07:41:55:WU02:FS01:Download 60.72% 07:42:01:WU02:FS01:Download 73.49% 07:42:07:WU02:FS01:Download 86.26% 07:42:13:WU02:FS01:Download 98.75% 07:42:13:WU02:FS01:Download complete 07:42:13:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:14545 run:0 clone:653 gen:12 core:0x22 unit:0x000000170d5262775e80e224425749f3 07:42:19:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%) 07:46:36:WU00:FS01:0x22:Completed 1840000 out of 2000000 steps (92%) 07:50:53:WU00:FS01:0x22:Completed 1860000 out of 2000000 steps (93%) 07:55:13:WU00:FS01:0x22:Completed 1880000 out of 2000000 steps (94%) 07:59:31:WU00:FS01:0x22:Completed 1900000 out of 2000000 steps (95%) 08:03:50:WU00:FS01:0x22:Completed 1920000 out of 2000000 steps (96%) 08:08:18:WU00:FS01:0x22:Completed 1940000 out of 2000000 steps (97%)

08:11:27:FS01:Paused

08:11:27:FS01:Shutting core down 08:11:28:WU00:FS01:0x22:WARNING:Console control signal 1 on PID 18536 08:11:28:WU00:FS01:0x22:Lost lifeline PID 18800, exiting 08:11:28:WU00:FS01:0x22:Exiting, please wait. . . 08:11:28:WU00:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED 08:11:28:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)

09:46:04:FS01:Unpaused

09:46:04:WU00:FS01:Starting 09:46:04:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" E:\FoldingHome\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 6468 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 09:46:04:WU00:FS01:Started FahCore on PID 18152 09:46:08:WU00:FS01:Core PID:21408 09:46:08:WU00:FS01:FahCore 0x22 started 09:46:10:WU00:FS01:0x22: Log Started 2020-04-19T09:46:10Z 09:46:10:WU00:FS01:0x22: Core22 Folding@home Core 09:46:10:WU00:FS01:0x22: Type: 0x22 09:46:10:WU00:FS01:0x22: Core: Core22 09:46:10:WU00:FS01:0x22: Website: https://foldingathome.org/ 09:46:10:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org 09:46:10:WU00:FS01:0x22: Author: John Chodera john.chodera@choderalab.org and Rafal Wiewiora 09:46:10:WU00:FS01:0x22: rafal.wiewiora@choderalab.org 09:46:10:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 18152 -checkpoint 15 09:46:10:WU00:FS01:0x22: -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 09:46:10:WU00:FS01:0x22: 0 -gpu 0 09:46:10:WU00:FS01:0x22: Config: 09:46:10:WU00:FS01:0x22:**** Build ***** 09:46:10:WU00:FS01:0x22: Version: 0.0.2 09:46:10:WU00:FS01:0x22: Date: Dec 6 2019 09:46:10:WU00:FS01:0x22: Time: 21:30:31 09:46:10:WU00:FS01:0x22: Repository: Git 09:46:10:WU00:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe 09:46:10:WU00:FS01:0x22: Branch: HEAD 09:46:10:WU00:FS01:0x22: Compiler: Visual C++ 2008 09:46:10:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT 09:46:10:WU00:FS01:0x22: Platform: win32 10 09:46:10:WU00:FS01:0x22: Bits: 64 09:46:10:WU00:FS01:0x22: Mode: Release 09:46:10:WU00:FS01:0x22:**** System **** 09:46:10:WU00:FS01:0x22: CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz 09:46:10:WU00:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3 09:46:10:WU00:FS01:0x22: CPUs: 4 09:46:10:WU00:FS01:0x22: Memory: 7.94GiB 09:46:10:WU00:FS01:0x22:Free Memory: 3.08GiB 09:46:10:WU00:FS01:0x22: Threads: WINDOWS_THREADS 09:46:10:WU00:FS01:0x22: OS Version: 6.2 09:46:10:WU00:FS01:0x22:Has Battery: true 09:46:10:WU00:FS01:0x22: On Battery: false 09:46:10:WU00:FS01:0x22: UTC Offset: 2 09:46:10:WU00:FS01:0x22: PID: 21408 09:46:10:WU00:FS01:0x22: CWD: E:\FoldingHome\work 09:46:10:WU00:FS01:0x22: OS: Windows 10 Pro 09:46:10:WU00:FS01:0x22: OS Arch: AMD64 09:46:10:WU00:FS01:0x22:**** 09:46:10:WU00:FS01:0x22:Project: 14432 (Run 0, Clone 1394, Gen 11) 09:46:10:WU00:FS01:0x22:Unit: 0x0000000e0d5262775e8b4d5706ef330d 09:46:10:WU00:FS01:0x22:Digital signatures verified 09:46:10:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core 09:46:10:WU00:FS01:0x22:Version 0.0.2 09:46:10:WU00:FS01:0x22: Found a checkpoint file 09:47:23:WARNING:WU00:FS01:Detected clock skew (1 mins 13 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates 09:47:31:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900 09:51:56:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%)

eifinger commented 4 years ago

Thank you for further logs. I am actually making progress with the next release but stumbled into some async issues.

I currently have a lot to do at work and need to give my mind something else to do than programming.

I plan to continue work on this next weekend.

TargetCrafter commented 4 years ago

After the latest update, the issue still exists..

eifinger commented 4 years ago

I just released v1.0.0.

Feel free to report back if this fixes this issue finally =)

eifinger commented 4 years ago

BREAKING CHANGE Please remove all entries for this integration, update and then readd them. BREAKING CHANGE

TargetCrafter commented 4 years ago

It seems like the problem has been solved. After restarting my computer, the status changed automatically. I'll try again in a couple of days.

Good job!

eifinger commented 4 years ago

@TargetCrafter thank you for testing. I will close the issue. If the problem appear again feel free to reopen it.

TargetCrafter commented 4 years ago

Sensors are stuck again, is there a time limit for it to reconnect? Right now i've not been folding for 1 hours, but when i tried it yesterday by quickly restarting my pc it did work.

eifinger commented 4 years ago
  1. On which value are your sensors stuck? The latest or unknown
  2. Which pc did you restart? The folding pc or homeassistant pc?`
  3. You have been folding for 1 hour now, but the sensors do not show current data?
  4. Do you have any log messages?
  5. If you changed the options, what is your update rate and read timeout?
TargetCrafter commented 4 years ago
  1. The sensors are stuck on pause, this was their last known state.
  2. i restarted the FaH pc.
  3. been folding for 3 hours now, still stuck at their last known state from 16 hours ago.
  4. don't know where.
  5. didn't change them, 5 and 15 respectively.
eifinger commented 4 years ago

You can go directly to the url <your-ha-url>/developer-tools/logs or Go via the UI on the bottom left the Hammer Icon "Developer Tools" and then the tab "Logs"

TargetCrafter commented 4 years ago

There are no errors there regarding this addon.

eifinger commented 4 years ago

Unless you turned off all error logging there has to be an error from when you turned off your FaH Pc that the connection got lost.

I could implement a "reset_connection" service however.