FoldingAtHome / fah-web-client-bastet

Folding@home client frontend
GNU General Public License v3.0
18 stars 6 forks source link

Status icon is misleading and ETA is zero when unit is Running #176

Open Hou5e opened 1 month ago

Hou5e commented 1 month ago

FAH v8.4.3 viewing Remotes (on Windows), the resource groups sometimes lose contact, missing running data updates, and are shown like this: image

I think I've seen it on Linux as well. I've seen this about 2-3 times in the last day. It mostly happened when FAH was update installing v8.4.2-->v8.4.3, or a PC came online and started folding again. Refreshing the browser fixes this issue.

It seems like it happens when the FAH web control is opened again (2nd instance causes the first instance to have the issue) either on that PC or a separate PC (and local network slow-down issues might be causing it).

jcoffland commented 1 month ago

Those show the clock icon so they are waiting to retry the run.

The status text is confusing. I intentionally setup the status text so that when waiting it shows the status it's waiting on rather than just Waiting. It might be nice if it said something like Waiting to run but then the text is getting too long. Alternatively, it could just say Waiting but then you don't know what it's waiting on. It could be waiting to rerun the core, waiting to download the core or WU or waiting to retry uploading the results.

It is supposed to show wait_progress which I would expect to be non-zero. Also, ETA should probably be non-zero.

kbernhagen commented 4 weeks ago

If you drop all the gerunds and verb phrases, it could be very short, like Wait: Run and Wait: Assign.

Maybe label State instead of Status Text, which seems ugly to me.

Hou5e commented 2 weeks ago

Those show the clock icon so they are waiting to retry the run.

Nope. I think this is where FAH literally missed the initial status message for the resource group, and doesn't know what the state is until the WU finishes and starts a new WU. Or, if you refresh the page, it will load all the data and fix itself, to show the correct 'Running' status icon and ETA times. Those resource groups with the waiting icon and 0 ETA really are 'Running', but the remote viewing is missing the information to display it correctly.

The title of this issue should be changed back to the original one...

jcoffland commented 1 week ago

Why do you think those WUs are running and not waiting? The status text says Running but that is because that is the state it is waiting to retry. The clock icon tells us that the WU is waiting.

Unless I'm still missing something, I think you're misinterpreting the bug in this case.

Hou5e commented 1 week ago

Yes, you are still mistaking the the state: They are definitely not waiting. You can see the true state from other web browser pages viewing remotes (from the same PC or other PCs). If you refresh a web page with those false ETA and icon states, it will fix itself, and display the correct information of the running icon and actual ETA time. You can also wait for the WU to complete (since it is not waiting) and the information displayed will correct itself when that WU uploads and the next WU starts.

Basically, this issue needs better error handling for when a state packet is missed/corrupted, and the incremental information doesn't fix the state until the state changes. Possibly checking for "Running" text and the Running icon. If those 2 items are not in agreement then ask for a full state update packet (instead of a partial update packet) or refresh the web page to force that to happen.

jcoffland commented 1 week ago

Ok, if two different instances of Web Control are disagreeing then that's a problem. How often does this occur?

It's very unlikely that there are missing or corrupted updates. The protocol prevents this. It is possible that something is causing an exception to be thrown which can cause an update to be discarded. If this is the case then the thrown exception would show up in the browser's developer console. The developer console must be open when it happens though.

Hou5e commented 1 week ago

Over the past 2 months, I have seen it less than 6 times. I've only seen it happen mid-day on weekdays (With most all FAH clients running, hotter part of the day, more internet traffic, internet or router capacity is more likely to be exceeded and stop working for 1-5 minutes arbitrarily). I've seen it the most when Pausing FAH, then updating to the latest FAH, and resuming (like somewhere in the shutdown / restart / run process the status gets lost to one FAH instance and not another). I have seen it happen for a PC starting up for the day (I'm typically not watching the PCs then, and would miss seeing it most of the time). The resource groups affected seem arbitrary, like in the original issue image of 1-2 Resource Groups that missed an update message, are on separate PCs. I'll try and leave a browser debug console open for this.

jcoffland commented 22 hours ago

I need a way to reproduce this.