FoldingAtHome / fah-client-bastet

Folding@home client, code named Bastet
GNU General Public License v3.0
73 stars 12 forks source link

Massive amount of WU Dumped #266

Closed jhitze closed 2 months ago

jhitze commented 3 months ago

How this started

I have a small script that pulls my current score and puts it with the date into a csv for me to watch my points grow. I noticed over the past few days that the points I was gaining each day was about 1/5 of what my estimated PPD was. After a lot of googling, I found https://apps.foldingathome.org/bonus and saw I was not getting bonus points.

Screenshot 2024-08-06 at 10 21 07 PM

I was shocked to see so many expired WUs! And the low finished percent explains why I'm not getting the bonus.

Log Hunting

I looked through my logs and found this:

14:31:58:I1:WU13570:Sending dump report
14:32:02:I1:WU13571:Sending dump report
14:32:07:I1:WU13572:Sending dump report
14:32:11:I1:WU13573:Sending dump report
14:32:16:I1:WU13574:Sending dump report
14:32:20:I1:WU13575:Sending dump report
14:32:26:I1:WU13576:Sending dump report
14:32:26:I1:WU13569:Sending dump report
14:32:30:I1:WU13577:Sending dump report
14:32:39:I1:WU13579:Sending dump report
14:32:45:I1:WU13580:Sending dump report
14:32:51:I1:WU13581:Sending dump report
14:32:58:I1:WU13582:Sending dump report
14:33:07:I1:WU13583:Sending dump report
14:33:12:I1:WU13584:Sending dump report
14:33:17:I1:WU13585:Sending dump report
14:33:22:I1:WU13586:Sending dump report
14:33:27:I1:WU13587:Sending dump report
14:33:27:I1:WU13578:Sending dump report
14:33:31:I1:WU13588:Sending dump report
14:33:37:I1:WU13590:Sending dump report
14:33:41:I1:WU13591:Sending dump report
14:33:43:I1:WU13589:Sending dump report
14:33:45:I1:WU13592:Sending dump report
14:33:51:I1:WU13594:Sending dump report
14:33:55:I1:WU13595:Sending dump report
14:34:00:I1:WU13596:Sending dump report
14:34:05:I1:WU13597:Sending dump report
14:34:10:I1:WU13598:Sending dump report
14:34:17:I1:WU13599:Sending dump report
14:34:22:I1:WU13600:Sending dump report
14:34:27:I1:WU13601:Sending dump report
14:34:32:I1:WU13602:Sending dump report
14:34:37:I1:WU13603:Sending dump report
14:34:42:I1:WU13604:Sending dump report
14:34:47:I1:WU13605:Sending dump report
14:34:52:I1:WU13606:Sending dump report
14:34:57:I1:WU13607:Sending dump report
14:35:01:I1:WU13608:Sending dump report
14:35:06:I1:WU13609:Sending dump report
14:35:11:I1:WU13610:Sending dump report
14:35:16:I1:WU13611:Sending dump report
14:35:21:I1:WU13612:Sending dump report
14:35:21:I1:WU13593:Sending dump report
14:35:25:I1:WU13613:Sending dump report
14:35:31:I1:WU13615:Sending dump report
14:35:37:I1:WU13616:Sending dump report
14:35:38:I1:WU13614:Sending dump report
14:35:45:I1:WU13617:Sending dump report
14:35:55:I1:WU13619:Sending dump report
14:36:00:I1:WU13620:Sending dump report
15:17:46:I1:WU13622:Sending dump report
15:59:06:I1:WU13624:Sending dump report
15:59:10:I1:WU13625:Sending dump report
15:59:15:I1:WU13626:Sending dump report
15:59:20:I1:WU13627:Sending dump report
15:59:26:I1:WU13628:Sending dump report
15:59:32:I1:WU13629:Sending dump report
15:59:37:I1:WU13630:Sending dump report
16:41:01:I1:WU13632:Sending dump report
17:22:23:I1:WU13634:Sending dump report
17:22:29:I1:WU13635:Sending dump report
17:22:31:I1:WU13618:Sending dump report
17:22:34:I1:WU13636:Sending dump report
17:22:41:I1:WU13638:Sending dump report
17:22:45:I1:WU13639:Sending dump report
18:04:03:I1:WU13641:Sending dump report
18:04:08:I1:WU13642:Sending dump report
18:04:13:I1:WU13643:Sending dump report
18:04:17:I1:WU13644:Sending dump report
18:04:22:I1:WU13645:Sending dump report
18:04:30:I1:WU13646:Sending dump report
18:45:53:I1:WU13648:Sending dump report
20:08:38:I1:WU13651:Sending dump report
20:50:03:I1:WU13653:Sending dump report
20:50:09:I1:WU13654:Sending dump report
20:50:13:I1:WU13655:Sending dump report
20:50:18:I1:WU13656:Sending dump report
21:31:37:I1:WU13658:Sending dump report
21:31:41:I1:WU13659:Sending dump report
21:31:46:I1:WU13660:Sending dump report
23:35:34:I1:WU13664:Sending dump report
23:35:38:I1:WU13665:Sending dump report
00:16:59:I1:WU13667:Sending dump report
00:17:05:I1:WU13668:Sending dump report
00:17:09:I1:WU13669:Sending dump report
00:58:26:I1:WU13671:Sending dump report
00:58:31:I1:WU13672:Sending dump report
00:58:36:I1:WU13673:Sending dump report
00:58:42:I1:WU13674:Sending dump report
00:58:47:I1:WU13675:Sending dump report
00:58:52:I1:WU13676:Sending dump report
00:58:59:I1:WU13677:Sending dump report
00:59:03:I1:WU13678:Sending dump report
00:59:08:I1:WU13679:Sending dump report
00:59:13:I1:WU13680:Sending dump report
00:59:17:I1:WU13681:Sending dump report
00:59:22:I1:WU13682:Sending dump report
00:59:28:I1:WU13683:Sending dump report
00:59:32:I1:WU13684:Sending dump report
00:59:37:I1:WU13685:Sending dump report
00:59:42:I1:WU13686:Sending dump report

As I'm writing this, it finished a WU, then did 13 more dumps. Here's an example of a WU that was dumped.

02:23:11:I1:WU13700:Requesting WU assignment for user {redacted} team {redacted}
02:23:11:I1:WU13700:Received WU assignment PAgNOMIoj7AF-xwOx2onO-oeMgxg-iKDQPbNhdad-tg
02:23:11:I1:WU13700:Downloading WU
02:23:14:I1:WU13700:DOWNLOAD 100% 39.52MiB of 39.52MiB
02:23:14:I1:WU13700:Received WU P18228 R147 C1 G12
02:23:15:I3:WU13700:Running FahCore: /config/cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23 -dir PAgNOMIoj7AF-xwOx2onO-oeMgxg-iKDQPbNhdad-tg -suffix 01 -version 8.3.18 -lifeline 174 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-platform 0 -cuda-device 0 -gpu 0
02:23:15:I3:WU13700:Started FahCore on PID 12571
02:23:16:W :WU13700:Core returned WU_STALLED (127)
02:23:16:I1:WU13700:Sending dump report
02:23:16:I1:WU13700:Dumped

Here's another WU that is 'interrupted' and somehow they are being killed?

02:22:50:I1:WU13696:Requesting WU assignment for user {redacted} team {redacted}
02:22:50:I1:WU13696:Received WU assignment y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8
02:22:50:I1:WU13696:Downloading WU
02:22:51:I1:WU13696:DOWNLOAD 100% 5.79MiB of 5.79MiB
02:22:51:I1:WU13696:Received WU P19227 R4611 C2 G1
02:22:51:I3:WU13696:Running FahCore: /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8 -dir y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8 -suffix 01 -version 8.3.18 -lifeline 174 -np 6
02:22:51:I3:WU13696:Started FahCore on PID 12550
02:22:51:I1:WU13696:*********************** Log Started 2024-08-07T02:22:51Z ***********************
02:22:51:I1:WU13696:************************** Gromacs Folding@home Core ***************************
02:22:51:I1:WU13696: Core: Gromacs
02:22:51:I1:WU13696: Type: 0xa8
02:22:51:I1:WU13696: Version: 0.0.12
02:22:51:I1:WU13696: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:22:51:I1:WU13696: Copyright: 2020 foldingathome.org
02:22:51:I1:WU13696: Homepage: https://foldingathome.org/
02:22:51:I1:WU13696: Date: Jan 16 2021
02:22:51:I1:WU13696: Time: 19:24:44
02:22:51:I1:WU13696: Compiler: GNU 8.3.0
02:22:51:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:22:51:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie
02:22:51:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:22:51:I1:WU13696: Bits: 64
02:22:51:I1:WU13696: Mode: Release
02:22:51:I1:WU13696: SIMD: avx2_256
02:22:51:I1:WU13696: OpenMP: ON
02:22:51:I1:WU13696: CUDA: OFF
02:22:51:I1:WU13696: Args: -dir y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8 -suffix 01
02:22:51:I1:WU13696: -version 8.3.18 -lifeline 174 -np 6
02:22:51:I1:WU13696:************************************ libFAH ************************************
02:22:51:I1:WU13696: Date: Jan 16 2021
02:22:51:I1:WU13696: Time: 19:21:38
02:22:51:I1:WU13696: Compiler: GNU 8.3.0
02:22:51:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:22:51:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie
02:22:51:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:22:51:I1:WU13696: Bits: 64
02:22:51:I1:WU13696: Mode: Release
02:22:51:I1:WU13696:************************************ CBang *************************************
02:22:51:I1:WU13696: Date: Jan 16 2021
02:22:51:I1:WU13696: Time: 19:21:24
02:22:51:I1:WU13696: Compiler: GNU 8.3.0
02:22:51:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:22:51:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
02:22:51:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:22:51:I1:WU13696: Bits: 64
02:22:51:I1:WU13696: Mode: Release
02:22:51:I1:WU13696:************************************ System ************************************
02:22:51:I1:WU13696: CPU: AMD Ryzen 5 5600G with Radeon Graphics
02:22:51:I1:WU13696: CPU ID: AuthenticAMD Family 25 Model 80 Stepping 0
02:22:51:I1:WU13696: CPUs: 12
02:22:51:I1:WU13696: Memory: 30.74GiB
02:22:51:I1:WU13696:Free Memory: 5.25GiB
02:22:51:I1:WU13696: Threads: POSIX_THREADS
02:22:51:I1:WU13696: OS Version: 6.1
02:22:51:I1:WU13696:Has Battery: false
02:22:51:I1:WU13696: On Battery: false
02:22:51:I1:WU13696: UTC Offset: -4
02:22:51:I1:WU13696: PID: 12550
02:22:51:I1:WU13696: CWD: /config/work
02:22:51:I1:WU13696:********************************************************************************
02:22:51:I1:WU13696:Project: 19227 (Run 4611, Clone 2, Gen 1)
02:22:51:I1:WU13696:Unit: 0x00000000000000000000000000000000
02:22:51:I1:WU13696:Reading tar file core.xml
02:22:51:I1:WU13696:Reading tar file md1.tpr
02:22:52:I1:WU13696:Caught signal SIGINT(2) on PID 12550
02:22:52:I1:WU13696:Exiting, please wait. . .
02:22:52:I1:WU13696:Digital signatures verified
02:22:52:I1:WU13696:Calling: mdrun -c md1.gro -s md1.tpr -x md1.xtc -cpt 5 -nt 6 -ntmpi 1
02:22:52:I1:WU13696:Steps: first=500000 total=1000000
02:23:00:I1:WU13696:Completed 1 out of 500000 steps (0%)
02:23:01:I1:WU13696:Folding@home Core Shutdown: INTERRUPTED
02:23:01:I1:WU13696:Core returned INTERRUPTED (102)
02:23:01:I3:WU13696:Running FahCore: /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8 -dir y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8 -suffix 01 -version 8.3.18 -lifeline 174 -np 5
02:23:01:I3:WU13696:Started FahCore on PID 12560
02:23:01:I1:WU13696:*********************** Log Started 2024-08-07T02:23:01Z ***********************
02:23:01:I1:WU13696:************************** Gromacs Folding@home Core ***************************
02:23:01:I1:WU13696: Core: Gromacs
02:23:01:I1:WU13696: Type: 0xa8
02:23:01:I1:WU13696: Version: 0.0.12
02:23:01:I1:WU13696: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:23:01:I1:WU13696: Copyright: 2020 foldingathome.org
02:23:01:I1:WU13696: Homepage: https://foldingathome.org/
02:23:01:I1:WU13696: Date: Jan 16 2021
02:23:01:I1:WU13696: Time: 19:24:44
02:23:01:I1:WU13696: Compiler: GNU 8.3.0
02:23:01:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:23:01:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie
02:23:01:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:23:01:I1:WU13696: Bits: 64
02:23:01:I1:WU13696: Mode: Release
02:23:01:I1:WU13696: SIMD: avx2_256
02:23:01:I1:WU13696: OpenMP: ON
02:23:01:I1:WU13696: CUDA: OFF
02:23:01:I1:WU13696: Args: -dir y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8 -suffix 01
02:23:01:I1:WU13696: -version 8.3.18 -lifeline 174 -np 5
02:23:01:I1:WU13696:************************************ libFAH ************************************
02:23:01:I1:WU13696: Date: Jan 16 2021
02:23:01:I1:WU13696: Time: 19:21:38
02:23:01:I1:WU13696: Compiler: GNU 8.3.0
02:23:01:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:23:01:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie
02:23:01:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:23:01:I1:WU13696: Bits: 64
02:23:01:I1:WU13696: Mode: Release
02:23:01:I1:WU13696:************************************ CBang *************************************
02:23:01:I1:WU13696: Date: Jan 16 2021
02:23:01:I1:WU13696: Time: 19:21:24
02:23:01:I1:WU13696: Compiler: GNU 8.3.0
02:23:01:I1:WU13696: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
02:23:01:I1:WU13696: -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
02:23:01:I1:WU13696: Platform: linux2 4.15.0-128-generic
02:23:01:I1:WU13696: Bits: 64
02:23:01:I1:WU13696: Mode: Release
02:23:01:I1:WU13696:************************************ System ************************************
02:23:01:I1:WU13696: CPU: AMD Ryzen 5 5600G with Radeon Graphics
02:23:01:I1:WU13696: CPU ID: AuthenticAMD Family 25 Model 80 Stepping 0
02:23:01:I1:WU13696: CPUs: 12
02:23:01:I1:WU13696: Memory: 30.74GiB
02:23:01:I1:WU13696:Free Memory: 5.17GiB
02:23:01:I1:WU13696: Threads: POSIX_THREADS
02:23:01:I1:WU13696: OS Version: 6.1
02:23:01:I1:WU13696:Has Battery: false
02:23:01:I1:WU13696: On Battery: false
02:23:01:I1:WU13696: UTC Offset: -4
02:23:01:I1:WU13696: PID: 12560
02:23:01:I1:WU13696: CWD: /config/work
02:23:01:I1:WU13696:********************************************************************************
02:23:01:I1:WU13696:Project: 19227 (Run 4611, Clone 2, Gen 1)
02:23:01:I1:WU13696:Unit: 0x00000000000000000000000000000000
02:23:01:I1:WU13696:Digital signatures verified
02:23:01:I1:WU13696:Calling: mdrun -c md1.gro -s md1.tpr -x md1.xtc -cpi state.cpt -cpt 5 -nt 5 -ntmpi 1
02:23:01:I1:WU13696:Steps: first=500000 total=1000000
02:23:06:I1:WU13696:Caught signal SIGINT(2) on PID 12560
02:23:06:I1:WU13696:Exiting, please wait. . .
02:23:10:I1:WU13696:Completed 12 out of 500000 steps (0%)
02:23:10:I1:WU13696:Folding@home Core Shutdown: INTERRUPTED
02:23:11:I1:WU13696:Core returned INTERRUPTED (102)
02:23:11:I3:WU13696:Running FahCore: /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8 -dir y1L3hc2o5tHtdU2QJkbAMnKYKIi82TqvlgFwbqxBHg8 -suffix 01 -version 8.3.18 -lifeline 174 -np 5
02:23:11:I3:WU13696:Started FahCore on PID 12570
02:23:11:W :WU13696:Core was killed
02:23:11:W :WU13696:Core returned FAILED_1 (0)
02:23:11:I1:WU13696:Sending dump report
02:23:11:I1:WU13696:Dumped

Machine Info

I have it set to work using 6 cores and GPU. The stalled WU seem to only happen on the GPU.

Screenshot 2024-08-06 at 10 28 25 PM

Plea

I don't know what to do. Would someone please help?

muziqaz commented 3 months ago

Your system is possibly unstable. Your last snip of the log shows that a CPU WU is being killed and reports as failed, not just GPU. Please make sure your system is stable. Also are you sure you are folding on only a CPU and nVidia GPU? Maybe your CPU iGPU is picking WUs, and since it possibly has no OpenCL platform, client fails it immediately and dumps it. CPU is failing because it is most likely unstable. If you are sure you are not folding on iGPU, then are you sure your nVidia GPU has required drivers and can create CUDA context to fold? Full system specs is always required in these kind of inquiries. Full logs are also helpful, as well as screenshot of the main page of web control UI

jhitze commented 3 months ago

Please make sure your system is stable.

I've been running F@H for several months with no issues.

I went through my older logs.

grep -r -c "WU_STALLED" ./*
./log-20240403-200138.txt:2
./log-20240425-074155.txt:0
./log-20240425-084825.txt:0
./log-20240429-010320.txt:0
./log-20240429-122529.txt:0
./log-20240502-124030.txt:0
./log-20240502-175727.txt:0
./log-20240503-053423.txt:0
./log-20240503-121240.txt:0
./log-20240504-161733.txt:1
./log-20240504-164751.txt:0
./log-20240530-080213.txt:2
./log-20240530-080852.txt:0
./log-20240530-082822.txt:0
./log-20240531-072223.txt:0
./log-20240601-072430.txt:0
./log-20240723-023534.txt:0
./log-20240723-071846.txt:0
./log-20240723-235958.txt:0
./log-20240724-071854.txt:0
./log-20240724-235959.txt:0
./log-20240725-071912.txt:0
./log-20240726-013134.txt:0
./log-20240726-013313.txt:0
./log-20240726-014553.txt:0
./log-20240726-014859.txt:0
./log-20240726-020403.txt:0
./log-20240726-021628.txt:0
./log-20240726-022027.txt:0
./log-20240726-023104.txt:0
./log-20240726-024043.txt:0
./log-20240726-024214.txt:0
./log-20240726-034457.txt:0
./log-20240726-035118.txt:0
./log-20240726-040110.txt:0
./log-20240726-040206.txt:0
./log-20240726-041234.txt:0
./log-20240726-235959.txt:621
./log-20240727-042610.txt:227
./log-20240727-235958.txt:479
./log-20240728-042705.txt:162
./log-20240728-235958.txt:1469
./log-20240729-042758.txt:240
./log-20240729-235958.txt:668
./log-20240730-042605.txt:109
./log-20240730-235958.txt:349
./log-20240731-042803.txt:22
./log-20240731-235958.txt:1486
./log-20240801-042753.txt:387
./log-20240801-235959.txt:974
./log-20240802-042748.txt:34
./log-20240802-235958.txt:195
./log-20240803-042822.txt:60
./log-20240803-235958.txt:287
./log-20240804-042814.txt:136
./log-20240804-181048.txt:824
./log-20240804-235957.txt:76
./log-20240805-042818.txt:412
./log-20240805-235959.txt:1787
./log-20240806-000000.txt:1
./log-20240806-042810.txt:129
./log-20240806-235959.txt:1168
./log-20240807-041142.txt:43

These log entries lead me to believe it is not the CPU. On August 26, I updated to 8.3 and added a 4060ti to the system.

Also are you sure you are folding on only a CPU and nVidia GPU?

That's what the v8.3 UI is saying. The iGPU is AMD and the GPU is nvidia. I only have the driver for nvidia installed; the f@h app has never considered the iGPU as available for use.

Full system specs is always required in these kind of inquiries. Full logs are also helpful, as well as screenshot of the main page of web control UI

Here's another screenshot, including the iGPU that isn't supported.

Screenshot 2024-08-07 at 1 05 08 PM

After I made this issue, I tried upgrading the GPU driver from v550 to v560. That did not make a difference. After that, I disabled the 4060ti from folding after I made this issue. It is still showing a bunch of WU_STALLED errors (you can see them in the grep output.

I have plex, ghost, minecraft server, plex, home assistant, and mysql all running on this machine and they are not having any issues with the CPU.

Later today I'm going to take out the 4060ti from the system and see if v8.3 stalls out again.

jcoffland commented 3 months ago

The core that failed with INTERRUPTED in the example log you show above is 0xa8. I think you may have actually paused this WU. It shouldn't have dumped the WU but I think that was caused by a pausing bug in v8 that was recently fixed.

The other issue, WU_STALLED with core 0x23, normally indicates that the science code in the core is not responding. I.e. not giving any updates to the core wrapper code. But it is happening after only running for 1 second so there hasn't been sufficient time for the core to actually to stall. Likely it just happened to return the error code 127.

Since this started happening regularly when you installed the 4060ti there's a good chance that is the culprit. I assume you mean you installed it on July 26th, not August.

It's interesting that core 0x23 doesn't actually print anything before it crashes. Also, it says Core was killed which means some other process killed the core. It is likely your Linux system killing the process. Check your system logs for OOM (Out of Memory) entries.

muziqaz commented 3 months ago

Please state full system specs here. Attach few logs, which contain stalled issue. OS details

The fact that you have been foding stable for months, does not mean that some things don't break over time. FAH is extremely resource heavy. Not well built, or not well cooled GPUs might die over time. Your log snippets posted in original comment show your CPU crapping out. It is possible that some sort of OS update might have broken things. It is not uncommon in Linux world to have some distro updates brick the OS. Have you considered launching VM with windows inside to see how that is folding on a CPU.

jhitze commented 2 months ago

Things you asked for

Hardware/Software # System Specs | Item | Description | | --- | --- | | CPU | Ryzen 5600G | | Motherboard | ASRock x570s | | RAM | G.SKILL RipJaws V Series 32gb (2x16gb) DDR4 | | GPU | RTX 4060TI | | Power | EVGA 750w BQ | | hard drives | 5x WD NAS 6tb | | ssd | 1 tb Samsung 870 EVO | # Software | Item | Description | | --- | --- | | os | unraid 6.12.11 | | gpu | nVidia v560.31.02 | # Temps | Item | Description | | --- | --- | | CPU | 90C (95C is the thermal throttle limit for this CPU| | GPU | 65C|
Logs [log-20240804-235957-sanitized.txt](https://github.com/user-attachments/files/16683079/log-20240804-235957-sanitized.txt)

My debugging

Each of these steps were run for ~24 hours.

  1. Removed disabled folding on GPU - problem still stayed w/CPU (5cores)
  2. Reduced CPU cores - problem still stayed w/CPU (2 cores)
  3. Upgraded GPU driver from v550 to v560, reenabled GPU - problem still stayed with both (6 CPU, 1GPU)
  4. Removed GPU from system - problem still stayed w/CPU (5 cores)
  5. Upgraded FaH container to 8.3.18-ls138 - no problems
  6. Added GPU back in, but not enabled - no problems
  7. Enabled GPU - no problems
  8. I did not like having my CPU temps so high, so I put in a new cooler last night and it's now running about 63C with 6CPU, 1GPU. - still no problems

Research

I looked into why this one update would fix it. It turns out that there is a need for the package libexpat1. I looked into bastet, and libexpat1 it is not in the .deb. It is however required as part of the scons package required for development.

I looked through my logs and found this:

 # Cores that had WU_STALLED

       5 /app/usr/bin/FAHCoreWrapper
    1215 /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8
   11357 /config/cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23
# All cores:

     209 Running FahCore: /app/usr/bin/FAHCoreWrapper /config/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/
   10769 Running FahCore: /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/
     362 Running FahCore: /config/cores/gromacs-core-a9/debian-stable-64bit/cpu-avx2_256-release/fahcore-a9-debian-stable-64bit-cpu-avx2_256-release-0.0.12/
     251 Running FahCore: /config/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/
   12575 Running FahCore: /config/cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/

This leads me to believe that not all of the cores are having issues, which points to a software issue.

Conclusion

My guess is that the problematic cores are using some xml parsing on GPUs, which cbang is abstracting away. Because cbang is abstracting it away, the calls to the .so aren't detected as a dependency. That call then causes a seg fault, which isn't caught. That would explain the sudden demise of the cores in the logs.

Since the local dev environments have libexpat1 because of scons, none of the developers would run into this issue.

Let me know if you'd like me to submit a PR for this fix. I'm also happy to help dockerize and create build/test/release pipelines, that's what I do for a day job. 🙂

Thanks for your patience!

jcoffland commented 2 months ago

Thank you for going to all this trouble.

Core 0x23 is linked to /lib/x86_64-linux-gnu/libexpat.so.1 but 0xa8 is not. It more likely a GLIBC mismatch.

jcoffland commented 2 months ago

A system lacking libexpat.so.1 will definitely fail core 0x23. I'm adding libexpat1 as an install dependency for v8 to work around this.

Core 0xa8 must have failed for some other reason. The failure rate in your logs is about 11% for 0xa8. I wonder are the failures all a group or spread out? I would be interested to see timestamps on the 0xa8 failures.

Also, if you see FAHCoreWrapper in the command then you are looking at v7 logs.

jcoffland commented 2 months ago

Also, if you run this:

$ cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23
cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23: error while loading shared libraries: libOpenMM.so.8.0: cannot open shared object file: No such file or directory
$ echo $?
127

You can see that the return code when a core fails to find a library is 127 which unfortunately is the same as our WU_STALLED return code. So the client cannot tell the difference and reports a missing library as WU_STALLED. Interestingly, this is also the same return code you get for, "command not found" in Linux.

jcoffland commented 2 months ago

Addressed by several changes in v8.4.4.

jhitze commented 2 months ago

@jcoffland would it be possible to get my bonus turned back on? I'm at 56.45% "finished". I'm getting about 1.2m ppd currently, but the app is estimating near 7 million.

jcoffland commented 2 months ago

Normally no. I could make an exception if I knew your user name.