Closed muziqaz closed 5 months ago
Please post the log. The v8 client does check for OpenCL support before enabling the GPU. This should be apparent in the log file.
clinfo.txt fahcore22 log.txt fahcore24 log.txt
I attach clinfo output, fahcore22 log contains log entries after I start the slot to get fahcore22 project WU, fahcore24 log contains entries after I start the slot to get fahcore24 project WU
I don't think we can definitely conclude that this a v8 core 0x22 problem with detecting OpenCL. It could be a bad WU from a misconfigured project. Further investigation is needed.
Also, I see you're running v8.3.5. Please test with the latest alpha v8.3.16.
I don't think we can definitely conclude that this a v8 core 0x22 problem with detecting OpenCL. It could be a bad WU from a misconfigured project. Further investigation is needed.
Joe, trust me, when I say that this is exclusive to v8 Linux AMD, and definitely not misconfigured projects, as I tried with many different projects, all core22 ones are failing in the same fashion. Was the latest alpha announced internally? I don't see it anywhere. Channel title still says v8.3.5. Also, it would be nice for you to pop in there from time to time to read through discussions which show up everytime you release new version and announce it. Lots of interesting discussions and observations, which never make it to github ;) Like Mesa issue which was asked here couple of days ago, and that was discussed months ago over there. Or this issue with fahcore22 has been mentioned since at least 8.2.3, I believe, which was released last year.
If you want issues to be considered they must be filed on Github. Discussing them on Slack is not enough.
If you want issues to be considered they must be filed on Github. Discussing them on Slack is not enough.
Discussions over there are benefitial, because you have more eyes on the issues and sometimes other testers have similar issues which might paint a bigger picture about the issue and what is causing it.
The discussions are beneficial but they should lead to a Github issue.
I found one of your core logs on the server:
Project: 17647 (Run 0, Clone 1, Gen 7)
Reading tar file core.xml
Reading tar file integrator.xml.bz2
Reading tar file state.xml.bz2
Reading tar file system.xml.bz2
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
Checkpoint write interval: 625000 steps (5%) [20 total]
JSON viewer frame write interval: 125000 steps (1%) [100 total]
XTC frame write interval: 250000 steps (2%) [50 total]
Global context and integrator variables write interval: disabled
There are 2 platforms available.
Platform 0: Reference
Platform 1: CPU
opencl-device was set but OpenCL platform could not be found.
ERROR:126: Neither CUDA nor OpenCL is available.
Saving result file ../logfile_01.txt
You are right that 0x22 is not detecting 0penCL correctly. This may have been fixed in the alpha.
I found one of your core logs on the server:
Project: 17647 (Run 0, Clone 1, Gen 7) Reading tar file core.xml Reading tar file integrator.xml.bz2 Reading tar file state.xml.bz2 Reading tar file system.xml.bz2 Digital signatures verified Folding@home GPU Core22 Folding@home Core Version 0.0.20 Checkpoint write interval: 625000 steps (5%) [20 total] JSON viewer frame write interval: 125000 steps (1%) [100 total] XTC frame write interval: 250000 steps (2%) [50 total] Global context and integrator variables write interval: disabled There are 2 platforms available. Platform 0: Reference Platform 1: CPU opencl-device was set but OpenCL platform could not be found. ERROR:126: Neither CUDA nor OpenCL is available. Saving result file ../logfile_01.txt
You are right that 0x22 is not detecting 0penCL correctly. This may have been fixed in the alpha.
Yes, that's what I am seeing in one of my attached logs, over here https://github.com/FoldingAtHome/fah-client-bastet/issues/245#issuecomment-2131246258
P.S. issue title is incorrect. System is all AMD
FYI, we are now conducting alpha testing here on Github. I mentioned this discussion channel on our Slack on Feb. 5th. https://github.com/FoldingAtHome/fah-client-bastet/discussions/179
FYI, we are now conducting alpha testing here on Github. I mentioned this discussion channel on our Slack on Feb. 5th. #179
That is not ideal Latest alpha has same issue
Does it work if you delete the supplied libOpenCL.so.1?
sudo rm /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/libOpenCL.so.1
This core and project are running fine on my system but with CUDA. Interestingly, the core still manages to find the OpenCL platform.
Does it work if you delete the supplied libOpenCL.so.1?
sudo rm /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/libOpenCL.so.1
oooh, that's playing with fire, I like it. Trying right now, bit rusty with navigating linux. One sec
This core and project are running fine on my system but with CUDA. Interestingly, the core still manages to find the OpenCL platform.
Yes, it seems to work fine on nVidia CUDA, and most of the time on nVidia opencl. I already can imagine the nightmare once AMD HIP (ROCm) fahcore hits the streets later this year
removing libOpenCL.so.1 from fahcore directory did not help
Here is ldd output of fahcore22 on that system. Everything seems fine
I'm also able to force my NVidia GPU to use OpenCL instead of CUDA by running the core manually on the p17647 WU.
Here is ldd fahcore22, 23 and 24 for comparison. Interestingly core23 and 24 are running fine, even though ldd says that they cannot see certain libraries which are supposedly required by fahcore23/24, and are provided in fahcore23/24 directory. In the past (with v7), if ldd fahcore22 was telling me that core cannot find libOpenMM.so.7.7, it would not run stating that openmm library cannot be found. I'm not sure what type of magic Hugo (or openmm) used, fahcore23/24 are running fine even when they claim (though "ldd") they don't see libOpenMM.so.8.1 (among other things)
I'm also able to force my NVidia GPU to use OpenCL instead of CUDA by running the core manually on the p17647 WU.
fahcore22 was running fine on AMD on v7.6.21 on that system. This is v8 exclusive
At least for project 17647, I cannot find any instances of Linux with AMD and v7 in the WS logs. There are AMD GPUs running v7 on Windows but only v8 on Linux.
In fact, that WS has no records of any instance of v7 on Linux with an AMD GPU.
Here is ldd fahcore22, 23 and 24 for comparison. Interestingly core23 and 24 are running fine, even though ldd says that they cannot see certain libraries which are supposedly required by fahcore23/24, and are provided in fahcore23/24 directory. In the past (with v7), if ldd fahcore22 was telling me that core cannot find libOpenMM.so.7.7, it would not run stating that openmm library cannot be found. I'm not sure what type of magic Hugo (or openmm) used, fahcore23/24 are running fine even when they claim (though "ldd") they don't see libOpenMM.so.8.1 (among other things)
The client sets LD_LIBRARY_PATH
to the directory where the core exists when running the core. You can get the same effect like this:
LD_LIBRARY_PATH=/var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20 ldd /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/FahCore_22
In fact, that WS has no records of any instance of v7 on Linux with an AMD GPU.
When I said that core22 works with v7, I meant historically. That Linux system is v8 only for testing purposes. Internally I am the only one who runs AMD. And since v8 is dumping fahcore22 WUs, I tend to not run anything on it, unless we are testing something with core23/24
Here is an example of core 0x22 working on Linux with and AMD GPU:
*********************** Log Started 2024-05-23T02:48:53Z ***********************
*************************** Core22 Folding@home Core ***************************
Core: Core22
Type: 0x22
Version: 0.0.20
Author: Joseph Coffland <joseph@cauldrondevelopment.com>
Copyright: 2020 foldingathome.org
Homepage: https://foldingathome.org/
Date: Jan 20 2022
Time: 00:57:52
Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
-DOPENMM_VERSION="\"7.7.0\""
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
<peastman@stanford.edu>
Args: -dir ANx4N0CA0yLb8z1hRv2yWDbf765ibcsZ68O8khAAbRw -suffix 01
-version 8.3.5 -lifeline 4147 -gpu-vendor amd -opencl-platform 0
-opencl-device 0 -gpu 0
************************************ libFAH ************************************
Date: Jan 20 2022
Time: 00:57:22
Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
************************************ CBang *************************************
Date: Jan 20 2022
Time: 00:57:00
Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie -fPIC
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
************************************ System ************************************
CPU: AMD Ryzen 5 1600X Six-Core Processor
CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
CPUs: 12
Memory: 15.54GiB
Free Memory: 9.59GiB
Threads: POSIX_THREADS
OS Version: 6.9
Has Battery: false
On Battery: false
UTC Offset: 8
PID: 4605
CWD: /var/lib/private/fah/work
************************************ OpenMM ************************************
Version: 7.7.0
********************************************************************************
Project: 12428 (Run 0, Clone 259, Gen 531)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
Checkpoint write interval: 50000 steps (2%) [50 total]
JSON viewer frame write interval: 25000 steps (1%) [100 total]
XTC frame write interval: 500000 steps (20%) [5 total]
Global context and integrator variables write interval: disabled
There are 3 platforms available.
Platform 0: Reference
Platform 1: CPU
Platform 2: OpenCL
opencl-device 0 specified
Attempting to create OpenCL context:
Configuring platform OpenCL
Using OpenCL on platformId 0 and gpu 0
Completed 0 out of 2500000 steps (0%)
Checkpoint completed at step 0
Completed 25000 out of 2500000 steps (1%)
I've found more than one such instance. However, I've also found more than one machine failing with a similar configuration on v8.
I searched for WU assignments on Linux with AMD GPU on v7 for user @muziqaz but was unable to find anything in the recent logs. Have you personally run v7 with this setup?
I'm looking for a pair of log files from the same system on Linux with an AMD GPU running the same project. One log file from the v7 client and the other v8.
Here is the same problem occurring on v7:
*********************** Log Started 2022-11-03T16:51:40Z ***********************
*************************** Core22 Folding@home Core ***************************
Core: Core22
Type: 0x22
Version: 0.0.20
Author: Joseph Coffland <joseph@cauldrondevelopment.com>
Copyright: 2020 foldingathome.org
Homepage: https://foldingathome.org/
Date: Jan 20 2022
Time: 00:57:52
Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
-DOPENMM_VERSION="\"7.7.0\""
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
<peastman@stanford.edu>
Args: -dir 01 -suffix 01 -version 706 -lifeline 29046 -checkpoint 15
-opencl-platform 0 -opencl-device 0 -gpu-vendor amd -gpu 0
-gpu-usage 100
************************************ libFAH ************************************
Date: Jan 20 2022
Time: 00:57:22
Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
************************************ CBang *************************************
Date: Jan 20 2022
Time: 00:57:00
Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
Branch: HEAD
Compiler: GNU 9.4.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie -fPIC
Platform: linux 5.11.0-1025-azure
Bits: 64
Mode: Release
************************************ System ************************************
CPU: AMD Ryzen 7 3700X 8-Core Processor
CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
CPUs: 16
Memory: 31.27GiB
Free Memory: 26.07GiB
Threads: POSIX_THREADS
OS Version: 6.0
Has Battery: false
On Battery: false
UTC Offset: 1
PID: 29050
CWD: /var/lib/private/fah/work
************************************ OpenMM ************************************
Version: 7.7.0
********************************************************************************
Project: 18454 (Run 5, Clone 16, Gen 110)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
Checkpoint write interval: 100000 steps (2%) [50 total]
JSON viewer frame write interval: 50000 steps (1%) [100 total]
XTC frame write interval: 500000 steps (10%) [10 total]
Global context and integrator variables write interval: disabled
There are 2 platforms available.
Platform 0: Reference
Platform 1: CPU
opencl-device was set but OpenCL platform could not be found.
ERROR:126: Neither CUDA nor OpenCL is available.
Saving result file ../logfile_01.txt
Note v7 passes something like -version 706
on the command line.
I have to conclude that this is not a v8 issue. It's a core 0x22 issue on Linux as it occurs on both clients.
OK, leave it at that for a moment, until I check myself out from madhouse :D
Just the follow up. core22 is folding on v7 and v8 on kubuntu. So it is safe to say Mint Linux might be the culprit
Fahcore_22 downloads a new WU and immediately crashes it, sends it back, downloads another one, crashes, etc, etc. 2 things here: v8 should simply disable the slot if opencl platform is not detected by fahcore, and not allow countless downloads of WUs and dumping them secondly, fahcore_22 not seeing opencl on Linux AMD. fahcore_22 works fine on on other platforms and v7. Fahcore_23 works fine on Linux AMD, too. I can trace it back to v8.2.3 up to most recent version. Before that v8 was not able to fold anything on Linux AMD GPUs. clinfo shows all devices and platforms present.