FoldingAtHome / fah-client-bastet

Folding@home client, code named Bastet
GNU General Public License v3.0
74 stars 13 forks source link

Core 0x22 fails to find OpenCL platform on Linux w/ AMD GPU #245

Closed muziqaz closed 5 months ago

muziqaz commented 6 months ago

Fahcore_22 downloads a new WU and immediately crashes it, sends it back, downloads another one, crashes, etc, etc. 2 things here: v8 should simply disable the slot if opencl platform is not detected by fahcore, and not allow countless downloads of WUs and dumping them secondly, fahcore_22 not seeing opencl on Linux AMD. fahcore_22 works fine on on other platforms and v7. Fahcore_23 works fine on Linux AMD, too. I can trace it back to v8.2.3 up to most recent version. Before that v8 was not able to fold anything on Linux AMD GPUs. clinfo shows all devices and platforms present.

jcoffland commented 6 months ago

Please post the log. The v8 client does check for OpenCL support before enabling the GPU. This should be apparent in the log file.

muziqaz commented 6 months ago

clinfo.txt fahcore22 log.txt fahcore24 log.txt

I attach clinfo output, fahcore22 log contains log entries after I start the slot to get fahcore22 project WU, fahcore24 log contains entries after I start the slot to get fahcore24 project WU

jcoffland commented 6 months ago

I don't think we can definitely conclude that this a v8 core 0x22 problem with detecting OpenCL. It could be a bad WU from a misconfigured project. Further investigation is needed.

jcoffland commented 6 months ago

Also, I see you're running v8.3.5. Please test with the latest alpha v8.3.16.

muziqaz commented 6 months ago

I don't think we can definitely conclude that this a v8 core 0x22 problem with detecting OpenCL. It could be a bad WU from a misconfigured project. Further investigation is needed.

Joe, trust me, when I say that this is exclusive to v8 Linux AMD, and definitely not misconfigured projects, as I tried with many different projects, all core22 ones are failing in the same fashion. Was the latest alpha announced internally? I don't see it anywhere. Channel title still says v8.3.5. Also, it would be nice for you to pop in there from time to time to read through discussions which show up everytime you release new version and announce it. Lots of interesting discussions and observations, which never make it to github ;) Like Mesa issue which was asked here couple of days ago, and that was discussed months ago over there. Or this issue with fahcore22 has been mentioned since at least 8.2.3, I believe, which was released last year.

jcoffland commented 6 months ago

If you want issues to be considered they must be filed on Github. Discussing them on Slack is not enough.

muziqaz commented 6 months ago

If you want issues to be considered they must be filed on Github. Discussing them on Slack is not enough.

Discussions over there are benefitial, because you have more eyes on the issues and sometimes other testers have similar issues which might paint a bigger picture about the issue and what is causing it.

jcoffland commented 6 months ago

The discussions are beneficial but they should lead to a Github issue.

jcoffland commented 6 months ago

I found one of your core logs on the server:

Project: 17647 (Run 0, Clone 1, Gen 7)
Reading tar file core.xml
Reading tar file integrator.xml.bz2
Reading tar file state.xml.bz2
Reading tar file system.xml.bz2
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
  Checkpoint write interval: 625000 steps (5%) [20 total]
  JSON viewer frame write interval: 125000 steps (1%) [100 total]
  XTC frame write interval: 250000 steps (2%) [50 total]
  Global context and integrator variables write interval: disabled
There are 2 platforms available.
Platform 0: Reference
Platform 1: CPU
opencl-device was set but OpenCL platform could not be found.
ERROR:126: Neither CUDA nor OpenCL is available.
Saving result file ../logfile_01.txt

You are right that 0x22 is not detecting 0penCL correctly. This may have been fixed in the alpha.

muziqaz commented 6 months ago

I found one of your core logs on the server:

Project: 17647 (Run 0, Clone 1, Gen 7)
Reading tar file core.xml
Reading tar file integrator.xml.bz2
Reading tar file state.xml.bz2
Reading tar file system.xml.bz2
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
  Checkpoint write interval: 625000 steps (5%) [20 total]
  JSON viewer frame write interval: 125000 steps (1%) [100 total]
  XTC frame write interval: 250000 steps (2%) [50 total]
  Global context and integrator variables write interval: disabled
There are 2 platforms available.
Platform 0: Reference
Platform 1: CPU
opencl-device was set but OpenCL platform could not be found.
ERROR:126: Neither CUDA nor OpenCL is available.
Saving result file ../logfile_01.txt

You are right that 0x22 is not detecting 0penCL correctly. This may have been fixed in the alpha.

Yes, that's what I am seeing in one of my attached logs, over here https://github.com/FoldingAtHome/fah-client-bastet/issues/245#issuecomment-2131246258

P.S. issue title is incorrect. System is all AMD

jcoffland commented 6 months ago

FYI, we are now conducting alpha testing here on Github. I mentioned this discussion channel on our Slack on Feb. 5th. https://github.com/FoldingAtHome/fah-client-bastet/discussions/179

muziqaz commented 6 months ago

FYI, we are now conducting alpha testing here on Github. I mentioned this discussion channel on our Slack on Feb. 5th. #179

That is not ideal Latest alpha has same issue

jcoffland commented 6 months ago

Does it work if you delete the supplied libOpenCL.so.1?

sudo rm /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/libOpenCL.so.1
jcoffland commented 6 months ago

This core and project are running fine on my system but with CUDA. Interestingly, the core still manages to find the OpenCL platform.

muziqaz commented 6 months ago

Does it work if you delete the supplied libOpenCL.so.1?

sudo rm /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/libOpenCL.so.1

oooh, that's playing with fire, I like it. Trying right now, bit rusty with navigating linux. One sec

This core and project are running fine on my system but with CUDA. Interestingly, the core still manages to find the OpenCL platform.

Yes, it seems to work fine on nVidia CUDA, and most of the time on nVidia opencl. I already can imagine the nightmare once AMD HIP (ROCm) fahcore hits the streets later this year

muziqaz commented 6 months ago

removing libOpenCL.so.1 from fahcore directory did not help

muziqaz commented 6 months ago

ldd fahcore22.txt

Here is ldd output of fahcore22 on that system. Everything seems fine

jcoffland commented 6 months ago

I'm also able to force my NVidia GPU to use OpenCL instead of CUDA by running the core manually on the p17647 WU.

muziqaz commented 6 months ago

ldd samples

Here is ldd fahcore22, 23 and 24 for comparison. Interestingly core23 and 24 are running fine, even though ldd says that they cannot see certain libraries which are supposedly required by fahcore23/24, and are provided in fahcore23/24 directory. In the past (with v7), if ldd fahcore22 was telling me that core cannot find libOpenMM.so.7.7, it would not run stating that openmm library cannot be found. I'm not sure what type of magic Hugo (or openmm) used, fahcore23/24 are running fine even when they claim (though "ldd") they don't see libOpenMM.so.8.1 (among other things)

muziqaz commented 6 months ago

I'm also able to force my NVidia GPU to use OpenCL instead of CUDA by running the core manually on the p17647 WU.

fahcore22 was running fine on AMD on v7.6.21 on that system. This is v8 exclusive

jcoffland commented 6 months ago

At least for project 17647, I cannot find any instances of Linux with AMD and v7 in the WS logs. There are AMD GPUs running v7 on Windows but only v8 on Linux.

jcoffland commented 6 months ago

In fact, that WS has no records of any instance of v7 on Linux with an AMD GPU.

jcoffland commented 6 months ago

Here is ldd fahcore22, 23 and 24 for comparison. Interestingly core23 and 24 are running fine, even though ldd says that they cannot see certain libraries which are supposedly required by fahcore23/24, and are provided in fahcore23/24 directory. In the past (with v7), if ldd fahcore22 was telling me that core cannot find libOpenMM.so.7.7, it would not run stating that openmm library cannot be found. I'm not sure what type of magic Hugo (or openmm) used, fahcore23/24 are running fine even when they claim (though "ldd") they don't see libOpenMM.so.8.1 (among other things)

The client sets LD_LIBRARY_PATH to the directory where the core exists when running the core. You can get the same effect like this:

LD_LIBRARY_PATH=/var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20 ldd /var/lib/fah-client/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/FahCore_22
muziqaz commented 6 months ago

In fact, that WS has no records of any instance of v7 on Linux with an AMD GPU.

When I said that core22 works with v7, I meant historically. That Linux system is v8 only for testing purposes. Internally I am the only one who runs AMD. And since v8 is dumping fahcore22 WUs, I tend to not run anything on it, unless we are testing something with core23/24

jcoffland commented 6 months ago

Here is an example of core 0x22 working on Linux with and AMD GPU:

*********************** Log Started 2024-05-23T02:48:53Z ***********************
*************************** Core22 Folding@home Core ***************************
       Core: Core22
       Type: 0x22
    Version: 0.0.20
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
  Copyright: 2020 foldingathome.org
   Homepage: https://foldingathome.org/
       Date: Jan 20 2022
       Time: 00:57:52
   Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
             -DOPENMM_VERSION="\"7.7.0\""
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
             <peastman@stanford.edu>
       Args: -dir ANx4N0CA0yLb8z1hRv2yWDbf765ibcsZ68O8khAAbRw -suffix 01
             -version 8.3.5 -lifeline 4147 -gpu-vendor amd -opencl-platform 0
             -opencl-device 0 -gpu 0
************************************ libFAH ************************************
       Date: Jan 20 2022
       Time: 00:57:22
   Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
************************************ CBang *************************************
       Date: Jan 20 2022
       Time: 00:57:00
   Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
************************************ System ************************************
        CPU: AMD Ryzen 5 1600X Six-Core Processor
     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
       CPUs: 12
     Memory: 15.54GiB
Free Memory: 9.59GiB
    Threads: POSIX_THREADS
 OS Version: 6.9
Has Battery: false
 On Battery: false
 UTC Offset: 8
        PID: 4605
        CWD: /var/lib/private/fah/work
************************************ OpenMM ************************************
    Version: 7.7.0
********************************************************************************
Project: 12428 (Run 0, Clone 259, Gen 531)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
  Checkpoint write interval: 50000 steps (2%) [50 total]
  JSON viewer frame write interval: 25000 steps (1%) [100 total]
  XTC frame write interval: 500000 steps (20%) [5 total]
  Global context and integrator variables write interval: disabled
There are 3 platforms available.
Platform 0: Reference
Platform 1: CPU
Platform 2: OpenCL
  opencl-device 0 specified
Attempting to create OpenCL context:
  Configuring platform OpenCL
  Using OpenCL on platformId 0 and gpu 0
Completed 0 out of 2500000 steps (0%)
Checkpoint completed at step 0
Completed 25000 out of 2500000 steps (1%)

I've found more than one such instance. However, I've also found more than one machine failing with a similar configuration on v8.

I searched for WU assignments on Linux with AMD GPU on v7 for user @muziqaz but was unable to find anything in the recent logs. Have you personally run v7 with this setup?

I'm looking for a pair of log files from the same system on Linux with an AMD GPU running the same project. One log file from the v7 client and the other v8.

jcoffland commented 6 months ago

Here is the same problem occurring on v7:

*********************** Log Started 2022-11-03T16:51:40Z ***********************
*************************** Core22 Folding@home Core ***************************
       Core: Core22
       Type: 0x22
    Version: 0.0.20
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
  Copyright: 2020 foldingathome.org
   Homepage: https://foldingathome.org/
       Date: Jan 20 2022
       Time: 00:57:52
   Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
             -DOPENMM_VERSION="\"7.7.0\""
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
             <peastman@stanford.edu>
       Args: -dir 01 -suffix 01 -version 706 -lifeline 29046 -checkpoint 15
             -opencl-platform 0 -opencl-device 0 -gpu-vendor amd -gpu 0
             -gpu-usage 100
************************************ libFAH ************************************
       Date: Jan 20 2022
       Time: 00:57:22
   Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
************************************ CBang *************************************
       Date: Jan 20 2022
       Time: 00:57:00
   Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
     Branch: HEAD
   Compiler: GNU 9.4.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
   Platform: linux 5.11.0-1025-azure
       Bits: 64
       Mode: Release
************************************ System ************************************
        CPU: AMD Ryzen 7 3700X 8-Core Processor
     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
       CPUs: 16
     Memory: 31.27GiB
Free Memory: 26.07GiB
    Threads: POSIX_THREADS
 OS Version: 6.0
Has Battery: false
 On Battery: false
 UTC Offset: 1
        PID: 29050
        CWD: /var/lib/private/fah/work
************************************ OpenMM ************************************
    Version: 7.7.0
********************************************************************************
Project: 18454 (Run 5, Clone 16, Gen 110)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.20
  Checkpoint write interval: 100000 steps (2%) [50 total]
  JSON viewer frame write interval: 50000 steps (1%) [100 total]
  XTC frame write interval: 500000 steps (10%) [10 total]
  Global context and integrator variables write interval: disabled
There are 2 platforms available.
Platform 0: Reference
Platform 1: CPU
opencl-device was set but OpenCL platform could not be found.
ERROR:126: Neither CUDA nor OpenCL is available.
Saving result file ../logfile_01.txt

Note v7 passes something like -version 706 on the command line.

I have to conclude that this is not a v8 issue. It's a core 0x22 issue on Linux as it occurs on both clients.

muziqaz commented 6 months ago

OK, leave it at that for a moment, until I check myself out from madhouse :D

muziqaz commented 5 months ago

Just the follow up. core22 is folding on v7 and v8 on kubuntu. So it is safe to say Mint Linux might be the culprit