Limit how many bad workunits are allowed in a row

hucker75 commented 1 year ago

I just had (and often do) have an old machine which screws up somehow and the GPU stops behaving. But it's fine for another few weeks after rebooting. Trouble is, until I see it misbehaving, it's downloading several workunits an hour (or presumably a lot more if we didn't have the current server overload problem). It just says BAD_WORKUNIT, shortly after trying to create an OpenCL context or something (I forget the precise wording), then immediately gets another, and another, ruining my bonus points and more importantly wasting the time of the server. Could it stop and somehow warn the user something is wrong after a few bad ones? "Multiple failures" appearing in the web control where it usually says "running" would be good.

hucker75 commented 1 year ago

The message I'm responding to in here has disappeared, but I'll answer it anyway. I counted about 25 consecutive failures, only ten minutes apart, the time taken to get another workunit since I'm suffering from the EOF problem. I get this:

11:16:29:I1::WU129:There are 3 platforms available.
11:16:29:I1::WU129:Platform 0: Reference
11:16:29:I1::WU129:Platform 1: CPU
11:16:29:I1::WU129:Platform 2: OpenCL
11:16:29:I1::WU129:  opencl-device 1 specified
11:18:20:I1::WU129:Attempting to create OpenCL context:
11:18:20:I1::WU129:  Configuring platform OpenCL
11:18:20:I1::WU129:Failed to create OpenCL context:
11:18:20:I1::WU129:Illegal value for DeviceIndex: 1
11:18:20:I1::WU129:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
11:18:20:I1::WU129:Saving result file ..\logfile_01.txt
11:18:20:I1::WU129:Saving result file science.log
11:18:20:I1::WU129:Folding@home Core Shutdown: BAD_WORK_UNIT
11:18:21:W ::WU129:Core returned BAD_WORK_UNIT (114)

Whereas the person I'm replying to said they got this:

 13:18:47:WU01:FS00:0x23:There are 3 platforms available.
 13:18:47:WU01:FS00:0x23:Platform 0: Reference
 13:18:47:WU01:FS00:0x23:Platform 1: CPU
 13:18:47:WU01:FS00:0x23:Platform 2: OpenCL
 13:18:47:WU01:FS00:0x23: opencl-device 0 specified
 13:19:26:WU01:FS00:0x23:ERROR:exception: 
 13:19:26:WU01:FS00:0x23:Saving result file ..\logfile_01.txt
 13:19:26:WU01:FS00:0x23:Saving result file science.log
 13:19:26:WU01:FS00:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
 13:19:26:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

I've not adjusted max-slot-errors, although I don' know where to look to check. It's not mentioned in C:\ProgramData\FAHClient\config.xml

PS I've tried single and double and triple ticks and I can't get the code thing in here to behave!

jcoffland commented 1 year ago

The later log is from a v7 client.

11:16:29:I1::WU129:  opencl-device 1 specified
11:18:20:I1::WU129:Attempting to create OpenCL context:
11:18:20:I1::WU129:  Configuring platform OpenCL
11:18:20:I1::WU129:Failed to create OpenCL context:

The above errorsmean that your GPU's OpenCL drivers are missing, not installed correctly or you've got your PATH environment variable set in a way that it's interfering with the core's ability to find the libs.

PS I've tried single and double and triple ticks and I can't get the code thing in here to behave!

You hadn't closed the triple ticks. I fixed it. The block should look like this:

```
content. . .
```

hucker75 commented 1 year ago

The drivers are fine, once it gets a workunit it runs it ok. Boinc also has no problem. This only started recently on all my machines and nothing's changed except maybe some windows updates.

My path is:

PATH=C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files (x86)\EaseUS\Todo Backup\bin;C:\Program Files\EmEditor;C:\Program Files\dotnet\;C:\Program Files (x86)\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\150\DTS\Binn\;C:\Program Files\Microsoft SQL Server\150\DTS\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files\PowerShell\7\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;C:\Program Files\FAHClient

Is anything wrong there?

As for the code insert, I think the only thing I did wrong was I have to put the closing ticks on a new line?

So why isn't the code inserting button on Github working? Is that a Github fault or something specific to the Folding pages on Github? I just clicked this button and pasted the code inbetween the single markers I got:

jcoffland commented 1 year ago

The problem may be with the new core 0x23. It may be requiring something from OpenCL that your driver or GPU do not support.

I don't know why Github works the way it does.

hucker75 commented 1 year ago

I would suggest it's going to happen with a lot of GPUs. I have one RX560, one R9 Nano, and eleven R9 280X. They're old, but not that old, I reckon a lot of people will have them or similar. The R9 Nano does OpenCL 2.0 properly, and it is also experiencing the problem (the 280X are OpenCL 1.0 and the RX 560 implements 2.0 badly). So is it failing repeatedly until it gives up and happens across a work unit it can handle?

If you're correct, and also the new core is not going to be made compatible, is there a way I can force the old core? Will the old core still work?

jcoffland commented 1 year ago

I've sent an email to the group working on core 0x23 asking for help with this.

anand-bhat commented 1 year ago

My apologies -- I authored the now deleted message but realised too late that my tests used the v7 client AND a project running core 0x23 while Peter's report was for the v8 client AND a project running on core 0x22. Sorry for the confusion.

@jcoffland - with the v7 client and 0x23, the slot paused after 10 consecutive failures as expected.

hucker75 commented 1 year ago

@jcoffland Sorry, I've got mixed up between two problems here. The EOF error which means it takes 10 attempts to not get work happens on all my machines and started in the last couple of weeks. The error described above with multiple bad work units was only one dodgy machine where the driver had perhaps crashed, and it's fine after rebooting. The EOF error needs looking into, but the other one is rare.

jchodera commented 1 year ago

Thanks so much for the report! Do you have any (PROJ, RUN, CLONE, GEN) info for the WUs that failed this way?

hucker75 commented 1 year ago

If you're referring to the ones which caused repeated

 13:19:26:WU01:FS00:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
 13:19:26:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

I believe those were a dodgy GPU and nothing for you to fix, apart from my machine didn't stop after 10 consecutive failures.

If you mean the EOF problem I'm getting on every machine almost every time, it's not specific work units, it's just every one. My apologies for thinking jcoffland was referring to my other problem earlier, I was half asleep.

hucker75 commented 1 year ago

Still getting EOF, also HTTP_SERVICE_UNAVAILABLE: {"error":{"message":"Please wait","code":503}} This cycle usually repeats for about 10-20 minutes until a task is finally received.

16:53:11:I1::Added new work unit: cpus:0 gpus:gpu:39:00:00
16:53:11:I1::WU415:Requesting WU assignment for user PeterHucker_GRC_53ed9d9b7d568cb7eb1ccc25a7dc4492 team 224497
16:53:11:I1:OUT108:> POST https://assign2.foldingathome.org/api/assign HTTP/1.1
16:53:11:I3:Connecting to assign2.foldingathome.org:443
16:53:11:I1:OUT108:< assign2.foldingathome.org:443 HTTP/1.1 200 HTTP_OK
16:53:11:I1::WU415:Received WU assignment 3dSPWWRJ6GfuX9T7ak1W3G2sudK5rLyUacCd18W2MFo
16:53:11:I1::WU415:Downloading WU
16:53:12:I1:OUT109:> POST https://vav17.fah.temple.edu/api/assign HTTP/1.1
16:53:12:I3:Connecting to vav17.fah.temple.edu:443
16:53:12:I1:OUT109:< vav17.fah.temple.edu:443 HTTP/1.1 503 HTTP_SERVICE_UNAVAILABLE
16:53:12:E ::WU415:HTTP_SERVICE_UNAVAILABLE: {"error":{"message":"Please wait","code":503}}
16:53:12:I1::WU415:Retry #1 in 2 secs
16:53:14:I1::WU415:Requesting WU assignment for user PeterHucker_GRC_53ed9d9b7d568cb7eb1ccc25a7dc4492 team 224497
16:53:14:I1:OUT110:> POST https://assign3.foldingathome.org/api/assign HTTP/1.1
16:53:14:I3:Connecting to assign3.foldingathome.org:443
16:53:15:I1:OUT110:< assign3.foldingathome.org:443 HTTP/1.1 200 HTTP_OK
16:53:15:I1::WU415:Received WU assignment _QrtWzwt7-5YI30QoDGXYjeXHmQzHCCRvQavCVw8plg
16:53:15:I1::WU415:Downloading WU
16:53:15:I1:OUT111:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:15:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:15:E ::WU415:Failed response: EOF
16:53:15:I1::WU415:Retry #2 in 4 secs
16:53:19:I1::WU415:Downloading WU
16:53:19:I1:OUT112:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:19:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:19:E ::WU415:Failed response: EOF
16:53:19:I1::WU415:Retry #3 in 8 secs
16:53:27:I1::WU415:Downloading WU
16:53:27:I1:OUT113:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:27:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:27:E ::WU415:Failed response: EOF
16:53:27:I1::WU415:Retry #4 in 16 secs

jcoffland commented 1 year ago

Is the EOF always from the same server(s)?

hucker75 commented 1 year ago

Yes, it's always fah01.physik.fu-berlin.de

jcoffland commented 1 year ago

There's a problem with fah01.physik.fu-berlin.de's SSL certificate. Because of this it works with v7 clients which use http but not with v8 clients which use https. I'll email the server's admins.

hucker75 commented 1 year ago

Thanks, SSL has a lot to answer for. It's also messing up most of the Boinc projects as the certificates expire and people don't notice until everything grinds to a halt. It all worked so well in the past....

FoldingAtHome / fah-client-bastet

Limit how many bad workunits are allowed in a row #141