Closed hucker75 closed 9 months ago
The message I'm responding to in here has disappeared, but I'll answer it anyway. I counted about 25 consecutive failures, only ten minutes apart, the time taken to get another workunit since I'm suffering from the EOF problem. I get this:
11:16:29:I1::WU129:There are 3 platforms available.
11:16:29:I1::WU129:Platform 0: Reference
11:16:29:I1::WU129:Platform 1: CPU
11:16:29:I1::WU129:Platform 2: OpenCL
11:16:29:I1::WU129: opencl-device 1 specified
11:18:20:I1::WU129:Attempting to create OpenCL context:
11:18:20:I1::WU129: Configuring platform OpenCL
11:18:20:I1::WU129:Failed to create OpenCL context:
11:18:20:I1::WU129:Illegal value for DeviceIndex: 1
11:18:20:I1::WU129:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
11:18:20:I1::WU129:Saving result file ..\logfile_01.txt
11:18:20:I1::WU129:Saving result file science.log
11:18:20:I1::WU129:Folding@home Core Shutdown: BAD_WORK_UNIT
11:18:21:W ::WU129:Core returned BAD_WORK_UNIT (114)
Whereas the person I'm replying to said they got this:
13:18:47:WU01:FS00:0x23:There are 3 platforms available.
13:18:47:WU01:FS00:0x23:Platform 0: Reference
13:18:47:WU01:FS00:0x23:Platform 1: CPU
13:18:47:WU01:FS00:0x23:Platform 2: OpenCL
13:18:47:WU01:FS00:0x23: opencl-device 0 specified
13:19:26:WU01:FS00:0x23:ERROR:exception:
13:19:26:WU01:FS00:0x23:Saving result file ..\logfile_01.txt
13:19:26:WU01:FS00:0x23:Saving result file science.log
13:19:26:WU01:FS00:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
13:19:26:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
I've not adjusted max-slot-errors
, although I don' know where to look to check. It's not mentioned in C:\ProgramData\FAHClient\config.xml
PS I've tried single and double and triple ticks and I can't get the code thing in here to behave!
The later log is from a v7 client.
11:16:29:I1::WU129: opencl-device 1 specified
11:18:20:I1::WU129:Attempting to create OpenCL context:
11:18:20:I1::WU129: Configuring platform OpenCL
11:18:20:I1::WU129:Failed to create OpenCL context:
The above errorsmean that your GPU's OpenCL drivers are missing, not installed correctly or you've got your PATH
environment variable set in a way that it's interfering with the core's ability to find the libs.
PS I've tried single and double and triple ticks and I can't get the code thing in here to behave!
You hadn't closed the triple ticks. I fixed it. The block should look like this:
``` content. . . ```
The drivers are fine, once it gets a workunit it runs it ok. Boinc also has no problem. This only started recently on all my machines and nothing's changed except maybe some windows updates.
My path is:
PATH=C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files (x86)\EaseUS\Todo Backup\bin;C:\Program Files\EmEditor;C:\Program Files\dotnet\;C:\Program Files (x86)\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\150\DTS\Binn\;C:\Program Files\Microsoft SQL Server\150\DTS\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files\PowerShell\7\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;C:\Program Files\FAHClient
Is anything wrong there?
As for the code insert, I think the only thing I did wrong was I have to put the closing ticks on a new line?
So why isn't the code inserting button on Github working? Is that a Github fault or something specific to the Folding pages on Github? I just clicked this button and pasted the code inbetween the single markers I got:
The problem may be with the new core 0x23. It may be requiring something from OpenCL that your driver or GPU do not support.
I don't know why Github works the way it does.
I would suggest it's going to happen with a lot of GPUs. I have one RX560, one R9 Nano, and eleven R9 280X. They're old, but not that old, I reckon a lot of people will have them or similar. The R9 Nano does OpenCL 2.0 properly, and it is also experiencing the problem (the 280X are OpenCL 1.0 and the RX 560 implements 2.0 badly). So is it failing repeatedly until it gives up and happens across a work unit it can handle?
If you're correct, and also the new core is not going to be made compatible, is there a way I can force the old core? Will the old core still work?
I've sent an email to the group working on core 0x23 asking for help with this.
My apologies -- I authored the now deleted message but realised too late that my tests used the v7 client AND a project running core 0x23 while Peter's report was for the v8 client AND a project running on core 0x22. Sorry for the confusion.
@jcoffland - with the v7 client and 0x23, the slot paused after 10 consecutive failures as expected.
@jcoffland Sorry, I've got mixed up between two problems here. The EOF error which means it takes 10 attempts to not get work happens on all my machines and started in the last couple of weeks. The error described above with multiple bad work units was only one dodgy machine where the driver had perhaps crashed, and it's fine after rebooting. The EOF error needs looking into, but the other one is rare.
Thanks so much for the report! Do you have any (PROJ, RUN, CLONE, GEN) info for the WUs that failed this way?
If you're referring to the ones which caused repeated
13:19:26:WU01:FS00:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
13:19:26:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
I believe those were a dodgy GPU and nothing for you to fix, apart from my machine didn't stop after 10 consecutive failures.
If you mean the EOF problem I'm getting on every machine almost every time, it's not specific work units, it's just every one. My apologies for thinking jcoffland was referring to my other problem earlier, I was half asleep.
Still getting EOF, also HTTP_SERVICE_UNAVAILABLE: {"error":{"message":"Please wait","code":503}} This cycle usually repeats for about 10-20 minutes until a task is finally received.
16:53:11:I1::Added new work unit: cpus:0 gpus:gpu:39:00:00
16:53:11:I1::WU415:Requesting WU assignment for user PeterHucker_GRC_53ed9d9b7d568cb7eb1ccc25a7dc4492 team 224497
16:53:11:I1:OUT108:> POST https://assign2.foldingathome.org/api/assign HTTP/1.1
16:53:11:I3:Connecting to assign2.foldingathome.org:443
16:53:11:I1:OUT108:< assign2.foldingathome.org:443 HTTP/1.1 200 HTTP_OK
16:53:11:I1::WU415:Received WU assignment 3dSPWWRJ6GfuX9T7ak1W3G2sudK5rLyUacCd18W2MFo
16:53:11:I1::WU415:Downloading WU
16:53:12:I1:OUT109:> POST https://vav17.fah.temple.edu/api/assign HTTP/1.1
16:53:12:I3:Connecting to vav17.fah.temple.edu:443
16:53:12:I1:OUT109:< vav17.fah.temple.edu:443 HTTP/1.1 503 HTTP_SERVICE_UNAVAILABLE
16:53:12:E ::WU415:HTTP_SERVICE_UNAVAILABLE: {"error":{"message":"Please wait","code":503}}
16:53:12:I1::WU415:Retry #1 in 2 secs
16:53:14:I1::WU415:Requesting WU assignment for user PeterHucker_GRC_53ed9d9b7d568cb7eb1ccc25a7dc4492 team 224497
16:53:14:I1:OUT110:> POST https://assign3.foldingathome.org/api/assign HTTP/1.1
16:53:14:I3:Connecting to assign3.foldingathome.org:443
16:53:15:I1:OUT110:< assign3.foldingathome.org:443 HTTP/1.1 200 HTTP_OK
16:53:15:I1::WU415:Received WU assignment _QrtWzwt7-5YI30QoDGXYjeXHmQzHCCRvQavCVw8plg
16:53:15:I1::WU415:Downloading WU
16:53:15:I1:OUT111:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:15:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:15:E ::WU415:Failed response: EOF
16:53:15:I1::WU415:Retry #2 in 4 secs
16:53:19:I1::WU415:Downloading WU
16:53:19:I1:OUT112:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:19:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:19:E ::WU415:Failed response: EOF
16:53:19:I1::WU415:Retry #3 in 8 secs
16:53:27:I1::WU415:Downloading WU
16:53:27:I1:OUT113:> POST https://fah01.physik.fu-berlin.de/api/assign HTTP/1.1
16:53:27:I3:Connecting to fah01.physik.fu-berlin.de:443
16:53:27:E ::WU415:Failed response: EOF
16:53:27:I1::WU415:Retry #4 in 16 secs
Is the EOF always from the same server(s)?
Yes, it's always fah01.physik.fu-berlin.de
There's a problem with fah01.physik.fu-berlin.de's SSL certificate. Because of this it works with v7 clients which use http but not with v8 clients which use https. I'll email the server's admins.
Thanks, SSL has a lot to answer for. It's also messing up most of the Boinc projects as the certificates expire and people don't notice until everything grinds to a halt. It all worked so well in the past....
I just had (and often do) have an old machine which screws up somehow and the GPU stops behaving. But it's fine for another few weeks after rebooting. Trouble is, until I see it misbehaving, it's downloading several workunits an hour (or presumably a lot more if we didn't have the current server overload problem). It just says BAD_WORKUNIT, shortly after trying to create an OpenCL context or something (I forget the precise wording), then immediately gets another, and another, ruining my bonus points and more importantly wasting the time of the server. Could it stop and somehow warn the user something is wrong after a few bad ones? "Multiple failures" appearing in the web control where it usually says "running" would be good.