Closed TargetCrafter closed 4 years ago
Thank you for opening an issue. I will look into it.
I just changed the configuration to only use 8 cores on my cpu, and the sensor in HA changed accordingly. So it seems like there is still a connection between the PI and my pc, it's just not updating the status.
After some more digging, it looks like the addon is adding a new sensor every time you change a setting like amount of cores used. I first change the setting to 4 cores, after wich i changed it to 8 cores, and in HA i have 3 CPU sensors now: cpu:15, cpu:4 and cpu:8.
Not that i'm changing the setting every day, but i'd be nice if this would be cleaned up to prevent having loads of useless sensors.
Thanks for the feedback. I created a new Issue to solve that. Cant promise when its done.
As for your initial problem. Are you sure that the PC was working on a workunit and thus the state should have changed?
Yes, the pc was running a calculation at that time, so it should've changed from paused to running.
It's also just showing a status from 14 hours ago, eventough i just ran a task this morning:
I could reproduce your error and have a fix. I still struggle to produce a unit test^^
Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.
Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.
Yes, rebooting HA fixes the problem, but i don't want to restart HA every time i restart my computer..
I could reproduce your error and have a fix. I still struggle to produce a unit test^^
If you need help testing, i'm willing to help.
I Just published v0.1.0. This should fix the reconnection issue.
Seems to be working now, will monitor over the next couple of days.
Thanks @eifinger
Even here seems working well, thank you. I would suggest, in case of lost connection, to reset counters (ETA, next attempt, etc.) and change current state to something else, meybe OFFLINE, instead staying stock to last reading (usually STOPPING)
After a while, connection is stuck but I don't know why. I tried to reboot the server and seemed working well, but now is stuck...
Again, rebooted the PC the connection didn't be enstablish again
Do you see anything in the logs?
No, nothing in the log. You can reproduce the issue detaching the cable from the PC.
Just had the same thing, after a pc restart both sensors are stuck in the PAUSED state. Just like @JanosAudron nothing in the logs.
Guess I'll be pulling some Ethernet cables tonight^^
Thank you for providing feedback!
Found the bug. Fix is on its way.
v0.1.1 is out and should finally fix this!
Hate to be that guy, but... Just restarted my pc, and it is stuck on paused again.. After restarting HA it picks up the correct status again.
@TargetCrafter do you use a password for the FoH Client on the PC that you restarted?
I did, but now that you talked about it, i tried to see if the issue still persists without a password, which is the case unfortunately..
It seems there is a race condition somewhere. I will keep investigating!
While I am trying to debug this, could you please include the following log configurations:
Either as data to the service call logger.set_level
:
custom_components.foldingathomecontrol: debug
FoldingAtHomeControl: debug
Or in configuration.yaml:
logger:
default: warning
logs:
custom_components.foldingathomecontrol: debug
FoldingAtHomeControl: debug
Just released v0.1.2. I tested it with two of my pc, pulling cables, turning off WIFI and quitting the FoH client.
I so hope this works for everyone now!
I installed the update as soon as possible, but just now received a task. Unfortunately this update didn't show up until a restart of HA.
I added the provided settings to my configuration.yaml file, where should i see the logs for this? i checked the logbook and dev tools logs, but saw nothing.
I'm starting to think i'm doing something wrong here..
You should see them in the dev logs. IF there are errors.
For me it's working without errors with the latest version.
One of my PC has been connected with a different IP (changed from wired to wireless connection) for 5-6 hours. Once reconnected with old IP, Home Assistant should recconnect itself but didn't. From log, I have this:
2020-04-12 18:33:35 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:33:48 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:01 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:14 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330 2020-04-12 18:34:29 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.128:36330
Attached at the present, foll log. More or less I reconnected the PC at "2020-04-12 18:34:29", since then no more data in the log, now is 20:07 FAH Log.txt
@JanosAudron Looks more like a problem on your PC than a problem with this addon. Maybe you didn't add the firewall rule to the profile you're using for you wifi connection?
I have 4 sensors, 1 CPU and 1 GPU on two PCs. I don't think is firewall fault, for two reason. First is, after HA reboot sensors started to work again (see the pictures attached). Second is because, after PC reconnection, we do not see any more logs into the file, like the addon got stock for any reason.
No more logs means it connected successfully again. What version of the integration are you on?
No more logs means it connected successfully again.
I do not agree because at every reconnection you can find a welcome message in the log, but was not the case.
DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.
I'm using version v.0.1.4
@JanosAudron I suspect a race condition is the root cause of your error.
I set the is_connected
flag of the serialconnection after the socket was opened but before the welcome message was received.
If the timeout for the whole connection is reached just before the welcome message was fully received, the operation is cancelled but the flag is_connected
is still true leading to a wrong state.
I fixed this behavior. It will be included in the next release.
However I cannot be sure that this is really what happened here.
If I reboot HASS, I get a disconnect, and need to reinitiate the custom component still :(
@baronKanon could explain a little bit further?
When you restart Homeassistant, the sensor shows unavailable
?
What do your logs tell you when being set as explained here?
@eifinger The sensor is unavailable after Home assistant restart, and the debug shows timeout error for connection to the client (on my PC). I usually have some timeouts for connections to external services during startup of HASS, as my RPi 3b+ has some stuff to set up. 20 minutes after initialization of the component, still no updating from the computer available.
I would like to add that I rename the entity ID to something easy to read, but no new entity is created. I guess this initial setup of the socket is also the culprit for not updating after my computer has been turned off for the night.
Are there any heartbeat to the client setup (if no updates for x amount of time - retry socket or something)? How many retries and with what frequency are they run during startup (so the system has the possibility to get up and running before the connection can be established)?
@baronKanon that is expected behavior. If during startup the connection attempt times out it is assumed that the endpoint is not reachable and is disabled.
However I do see your point in changing this.
Could you please open another issue for that?
I paused client more or less at 10.11am. After 1 hour and half, at 11.46 (read on FoldingAtHome log) I unpaused it, but at that time addon reported a connection timeout. Actually, for 10-20 seconds even FAH client lost the connection as was reporting "updating" Now, calcutation is running but addon is reporting unavailable.
Log from addon:
2020-04-18 17:06:05 WARNING (MainThread) [homeassistant.loader] You are using a custom integration for foldingathomecontrol which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant. 2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the FAHClient command server. 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback 2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.
2020-04-19 11:46:15 ERROR (MainThread) [FoldingAtHomeControl.serialconnection] Timeout while trying to read from 192.168.0.1:36330
2020-04-19 11:46:17 ERROR (MainThread) [custom_components.foldingathomecontrol] Disconnected from 192.168.0.1:36330. Trying to reconnect. 2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330 2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330
Log from FoldingAtHome (please consider 2h time shift)
07:29:21:WU00:FS01:0x22:Completed 1760000 out of 2000000 steps (88%) 07:33:39:WU00:FS01:0x22:Completed 1780000 out of 2000000 steps (89%) 07:37:56:WU00:FS01:0x22:Completed 1800000 out of 2000000 steps (90%) 07:37:58:WU02:FS01:Connecting to 65.254.110.245:80 07:37:59:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:37:59:WU02:FS01:Connecting to 18.218.241.186:80 07:37:59:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:37:59:WU02:FS01:Connecting to 65.254.110.245:80 07:38:00:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:00:WU02:FS01:Connecting to 18.218.241.186:80 07:38:00:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:00:ERROR:WU02:FS01:Exception: Could not get an assignment 07:38:01:WU02:FS01:Connecting to 65.254.110.245:80 07:38:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:01:WU02:FS01:Connecting to 18.218.241.186:80 07:38:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:02:WU02:FS01:Connecting to 65.254.110.245:80 07:38:02:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:38:02:WU02:FS01:Connecting to 18.218.241.186:80 07:38:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:38:03:ERROR:WU02:FS01:Exception: Could not get an assignment 07:39:01:WU02:FS01:Connecting to 65.254.110.245:80 07:39:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:39:02:WU02:FS01:Connecting to 18.218.241.186:80 07:39:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:39:02:WU02:FS01:Connecting to 65.254.110.245:80 07:39:03:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration 07:39:03:WU02:FS01:Connecting to 18.218.241.186:80 07:39:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration 07:39:03:ERROR:WU02:FS01:Exception: Could not get an assignment 07:40:38:WU02:FS01:Connecting to 65.254.110.245:80 07:40:39:WU02:FS01:Assigned to work server 13.82.98.119 07:40:39:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP106 [GeForce GTX 1060 3GB] 3935 from 13.82.98.119 07:40:39:WU02:FS01:Connecting to 13.82.98.119:8080 Date: 2020-04-19 07:41:25:WU02:FS01:Downloading 22.03MiB 07:41:31:WU02:FS01:Download 10.50% 07:41:37:WU02:FS01:Download 22.98% 07:41:43:WU02:FS01:Download 35.75% 07:41:49:WU02:FS01:Download 48.24% 07:41:55:WU02:FS01:Download 60.72% 07:42:01:WU02:FS01:Download 73.49% 07:42:07:WU02:FS01:Download 86.26% 07:42:13:WU02:FS01:Download 98.75% 07:42:13:WU02:FS01:Download complete 07:42:13:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:14545 run:0 clone:653 gen:12 core:0x22 unit:0x000000170d5262775e80e224425749f3 07:42:19:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%) 07:46:36:WU00:FS01:0x22:Completed 1840000 out of 2000000 steps (92%) 07:50:53:WU00:FS01:0x22:Completed 1860000 out of 2000000 steps (93%) 07:55:13:WU00:FS01:0x22:Completed 1880000 out of 2000000 steps (94%) 07:59:31:WU00:FS01:0x22:Completed 1900000 out of 2000000 steps (95%) 08:03:50:WU00:FS01:0x22:Completed 1920000 out of 2000000 steps (96%) 08:08:18:WU00:FS01:0x22:Completed 1940000 out of 2000000 steps (97%)
08:11:27:FS01:Paused
08:11:27:FS01:Shutting core down 08:11:28:WU00:FS01:0x22:WARNING:Console control signal 1 on PID 18536 08:11:28:WU00:FS01:0x22:Lost lifeline PID 18800, exiting 08:11:28:WU00:FS01:0x22:Exiting, please wait. . . 08:11:28:WU00:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED 08:11:28:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
09:46:04:FS01:Unpaused
09:46:04:WU00:FS01:Starting 09:46:04:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" E:\FoldingHome\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 6468 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 09:46:04:WU00:FS01:Started FahCore on PID 18152 09:46:08:WU00:FS01:Core PID:21408 09:46:08:WU00:FS01:FahCore 0x22 started 09:46:10:WU00:FS01:0x22: Log Started 2020-04-19T09:46:10Z 09:46:10:WU00:FS01:0x22: Core22 Folding@home Core 09:46:10:WU00:FS01:0x22: Type: 0x22 09:46:10:WU00:FS01:0x22: Core: Core22 09:46:10:WU00:FS01:0x22: Website: https://foldingathome.org/ 09:46:10:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org 09:46:10:WU00:FS01:0x22: Author: John Chodera john.chodera@choderalab.org and Rafal Wiewiora 09:46:10:WU00:FS01:0x22: rafal.wiewiora@choderalab.org 09:46:10:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 18152 -checkpoint 15 09:46:10:WU00:FS01:0x22: -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 09:46:10:WU00:FS01:0x22: 0 -gpu 0 09:46:10:WU00:FS01:0x22: Config: 09:46:10:WU00:FS01:0x22:**** Build ***** 09:46:10:WU00:FS01:0x22: Version: 0.0.2 09:46:10:WU00:FS01:0x22: Date: Dec 6 2019 09:46:10:WU00:FS01:0x22: Time: 21:30:31 09:46:10:WU00:FS01:0x22: Repository: Git 09:46:10:WU00:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe 09:46:10:WU00:FS01:0x22: Branch: HEAD 09:46:10:WU00:FS01:0x22: Compiler: Visual C++ 2008 09:46:10:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT 09:46:10:WU00:FS01:0x22: Platform: win32 10 09:46:10:WU00:FS01:0x22: Bits: 64 09:46:10:WU00:FS01:0x22: Mode: Release 09:46:10:WU00:FS01:0x22:**** System **** 09:46:10:WU00:FS01:0x22: CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz 09:46:10:WU00:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3 09:46:10:WU00:FS01:0x22: CPUs: 4 09:46:10:WU00:FS01:0x22: Memory: 7.94GiB 09:46:10:WU00:FS01:0x22:Free Memory: 3.08GiB 09:46:10:WU00:FS01:0x22: Threads: WINDOWS_THREADS 09:46:10:WU00:FS01:0x22: OS Version: 6.2 09:46:10:WU00:FS01:0x22:Has Battery: true 09:46:10:WU00:FS01:0x22: On Battery: false 09:46:10:WU00:FS01:0x22: UTC Offset: 2 09:46:10:WU00:FS01:0x22: PID: 21408 09:46:10:WU00:FS01:0x22: CWD: E:\FoldingHome\work 09:46:10:WU00:FS01:0x22: OS: Windows 10 Pro 09:46:10:WU00:FS01:0x22: OS Arch: AMD64 09:46:10:WU00:FS01:0x22:**** 09:46:10:WU00:FS01:0x22:Project: 14432 (Run 0, Clone 1394, Gen 11) 09:46:10:WU00:FS01:0x22:Unit: 0x0000000e0d5262775e8b4d5706ef330d 09:46:10:WU00:FS01:0x22:Digital signatures verified 09:46:10:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core 09:46:10:WU00:FS01:0x22:Version 0.0.2 09:46:10:WU00:FS01:0x22: Found a checkpoint file 09:47:23:WARNING:WU00:FS01:Detected clock skew (1 mins 13 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates 09:47:31:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900 09:51:56:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%)
Thank you for further logs. I am actually making progress with the next release but stumbled into some async issues.
I currently have a lot to do at work and need to give my mind something else to do than programming.
I plan to continue work on this next weekend.
After the latest update, the issue still exists..
I just released v1.0.0.
Feel free to report back if this fixes this issue finally =)
BREAKING CHANGE Please remove all entries for this integration, update and then readd them. BREAKING CHANGE
It seems like the problem has been solved. After restarting my computer, the status changed automatically. I'll try again in a couple of days.
Good job!
@TargetCrafter thank you for testing. I will close the issue. If the problem appear again feel free to reopen it.
Sensors are stuck again, is there a time limit for it to reconnect? Right now i've not been folding for 1 hours, but when i tried it yesterday by quickly restarting my pc it did work.
unknown
You can go directly to the url <your-ha-url>/developer-tools/logs
or Go via the UI on the bottom left the Hammer Icon "Developer Tools" and then the tab "Logs"
There are no errors there regarding this addon.
Unless you turned off all error logging there has to be an error from when you turned off your FaH Pc that the connection got lost.
I could implement a "reset_connection" service however.
Version of the custom_component
v0.0.1
I had the integration working yesterday, but i shutdown the pc overnight, and now it's showing a status from 15 hours ago. Even after restarting HA, still no result.