Closed jhitze closed 2 months ago
Your system is possibly unstable. Your last snip of the log shows that a CPU WU is being killed and reports as failed, not just GPU. Please make sure your system is stable. Also are you sure you are folding on only a CPU and nVidia GPU? Maybe your CPU iGPU is picking WUs, and since it possibly has no OpenCL platform, client fails it immediately and dumps it. CPU is failing because it is most likely unstable. If you are sure you are not folding on iGPU, then are you sure your nVidia GPU has required drivers and can create CUDA context to fold? Full system specs is always required in these kind of inquiries. Full logs are also helpful, as well as screenshot of the main page of web control UI
Please make sure your system is stable.
I've been running F@H for several months with no issues.
I went through my older logs.
grep -r -c "WU_STALLED" ./*
./log-20240403-200138.txt:2
./log-20240425-074155.txt:0
./log-20240425-084825.txt:0
./log-20240429-010320.txt:0
./log-20240429-122529.txt:0
./log-20240502-124030.txt:0
./log-20240502-175727.txt:0
./log-20240503-053423.txt:0
./log-20240503-121240.txt:0
./log-20240504-161733.txt:1
./log-20240504-164751.txt:0
./log-20240530-080213.txt:2
./log-20240530-080852.txt:0
./log-20240530-082822.txt:0
./log-20240531-072223.txt:0
./log-20240601-072430.txt:0
./log-20240723-023534.txt:0
./log-20240723-071846.txt:0
./log-20240723-235958.txt:0
./log-20240724-071854.txt:0
./log-20240724-235959.txt:0
./log-20240725-071912.txt:0
./log-20240726-013134.txt:0
./log-20240726-013313.txt:0
./log-20240726-014553.txt:0
./log-20240726-014859.txt:0
./log-20240726-020403.txt:0
./log-20240726-021628.txt:0
./log-20240726-022027.txt:0
./log-20240726-023104.txt:0
./log-20240726-024043.txt:0
./log-20240726-024214.txt:0
./log-20240726-034457.txt:0
./log-20240726-035118.txt:0
./log-20240726-040110.txt:0
./log-20240726-040206.txt:0
./log-20240726-041234.txt:0
./log-20240726-235959.txt:621
./log-20240727-042610.txt:227
./log-20240727-235958.txt:479
./log-20240728-042705.txt:162
./log-20240728-235958.txt:1469
./log-20240729-042758.txt:240
./log-20240729-235958.txt:668
./log-20240730-042605.txt:109
./log-20240730-235958.txt:349
./log-20240731-042803.txt:22
./log-20240731-235958.txt:1486
./log-20240801-042753.txt:387
./log-20240801-235959.txt:974
./log-20240802-042748.txt:34
./log-20240802-235958.txt:195
./log-20240803-042822.txt:60
./log-20240803-235958.txt:287
./log-20240804-042814.txt:136
./log-20240804-181048.txt:824
./log-20240804-235957.txt:76
./log-20240805-042818.txt:412
./log-20240805-235959.txt:1787
./log-20240806-000000.txt:1
./log-20240806-042810.txt:129
./log-20240806-235959.txt:1168
./log-20240807-041142.txt:43
These log entries lead me to believe it is not the CPU. On August 26, I updated to 8.3 and added a 4060ti to the system.
Also are you sure you are folding on only a CPU and nVidia GPU?
That's what the v8.3 UI is saying. The iGPU is AMD and the GPU is nvidia. I only have the driver for nvidia installed; the f@h app has never considered the iGPU as available for use.
Full system specs is always required in these kind of inquiries. Full logs are also helpful, as well as screenshot of the main page of web control UI
Here's another screenshot, including the iGPU that isn't supported.
After I made this issue, I tried upgrading the GPU driver from v550 to v560. That did not make a difference. After that, I disabled the 4060ti from folding after I made this issue. It is still showing a bunch of WU_STALLED
errors (you can see them in the grep
output.
I have plex, ghost, minecraft server, plex, home assistant, and mysql all running on this machine and they are not having any issues with the CPU.
Later today I'm going to take out the 4060ti from the system and see if v8.3 stalls out again.
The core that failed with INTERRUPTED
in the example log you show above is 0xa8. I think you may have actually paused this WU. It shouldn't have dumped the WU but I think that was caused by a pausing bug in v8 that was recently fixed.
The other issue, WU_STALLED
with core 0x23, normally indicates that the science code in the core is not responding. I.e. not giving any updates to the core wrapper code. But it is happening after only running for 1 second so there hasn't been sufficient time for the core to actually to stall. Likely it just happened to return the error code 127.
Since this started happening regularly when you installed the 4060ti there's a good chance that is the culprit. I assume you mean you installed it on July 26th, not August.
It's interesting that core 0x23 doesn't actually print anything before it crashes. Also, it says Core was killed
which means some other process killed the core. It is likely your Linux system killing the process. Check your system logs for OOM
(Out of Memory) entries.
Please state full system specs here. Attach few logs, which contain stalled issue. OS details
The fact that you have been foding stable for months, does not mean that some things don't break over time. FAH is extremely resource heavy. Not well built, or not well cooled GPUs might die over time. Your log snippets posted in original comment show your CPU crapping out. It is possible that some sort of OS update might have broken things. It is not uncommon in Linux world to have some distro updates brick the OS. Have you considered launching VM with windows inside to see how that is folding on a CPU.
Each of these steps were run for ~24 hours.
I looked into why this one update would fix it. It turns out that there is a need for the package libexpat1
. I looked into bastet, and libexpat1
it is not in the .deb
. It is however required as part of the scons
package required for development.
I looked through my logs and found this:
# Cores that had WU_STALLED
5 /app/usr/bin/FAHCoreWrapper
1215 /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8
11357 /config/cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23
# All cores:
209 Running FahCore: /app/usr/bin/FAHCoreWrapper /config/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/
10769 Running FahCore: /config/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/
362 Running FahCore: /config/cores/gromacs-core-a9/debian-stable-64bit/cpu-avx2_256-release/fahcore-a9-debian-stable-64bit-cpu-avx2_256-release-0.0.12/
251 Running FahCore: /config/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/
12575 Running FahCore: /config/cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/
This leads me to believe that not all of the core
s are having issues, which points to a software issue.
My guess is that the problematic core
s are using some xml parsing on GPUs, which cbang
is abstracting away. Because cbang
is abstracting it away, the calls to the .so
aren't detected as a dependency. That call then causes a seg fault, which isn't caught. That would explain the sudden demise of the core
s in the logs.
Since the local dev environments have libexpat1
because of scons
, none of the developers would run into this issue.
Let me know if you'd like me to submit a PR for this fix. I'm also happy to help dockerize and create build/test/release pipelines, that's what I do for a day job. 🙂
Thanks for your patience!
Thank you for going to all this trouble.
Core 0x23 is linked to /lib/x86_64-linux-gnu/libexpat.so.1
but 0xa8 is not. It more likely a GLIBC mismatch.
A system lacking libexpat.so.1
will definitely fail core 0x23. I'm adding libexpat1
as an install dependency for v8 to work around this.
Core 0xa8 must have failed for some other reason. The failure rate in your logs is about 11% for 0xa8. I wonder are the failures all a group or spread out? I would be interested to see timestamps on the 0xa8 failures.
Also, if you see FAHCoreWrapper
in the command then you are looking at v7 logs.
Also, if you run this:
$ cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23
cores/openmm-core-23/centos-7.9.2009-64bit/release/fahcore-23-centos-7.9.2009-64bit-release-8.0.3/FahCore_23: error while loading shared libraries: libOpenMM.so.8.0: cannot open shared object file: No such file or directory
$ echo $?
127
You can see that the return code when a core fails to find a library is 127
which unfortunately is the same as our WU_STALLED
return code. So the client cannot tell the difference and reports a missing library as WU_STALLED
. Interestingly, this is also the same return code you get for, "command not found" in Linux.
Addressed by several changes in v8.4.4.
@jcoffland would it be possible to get my bonus turned back on? I'm at 56.45% "finished". I'm getting about 1.2m ppd currently, but the app is estimating near 7 million.
Normally no. I could make an exception if I knew your user name.
How this started
I have a small script that pulls my current score and puts it with the date into a csv for me to watch my points grow. I noticed over the past few days that the points I was gaining each day was about 1/5 of what my estimated PPD was. After a lot of googling, I found https://apps.foldingathome.org/bonus and saw I was not getting bonus points.
I was shocked to see so many expired WUs! And the low finished percent explains why I'm not getting the bonus.
Log Hunting
I looked through my logs and found this:
As I'm writing this, it finished a WU, then did 13 more dumps. Here's an example of a WU that was dumped.
Here's another WU that is 'interrupted' and somehow they are being killed?
Machine Info
I have it set to work using 6 cores and GPU. The stalled WU seem to only happen on the GPU.
Plea
I don't know what to do. Would someone please help?