FoldingAtHome / fah-client-bastet

Folding@home client, code named Bastet
GNU General Public License v3.0
61 stars 10 forks source link

CPU folding interrupted when GPU WU completes #131

Closed jon-ault closed 7 months ago

jon-ault commented 1 year ago

On my Windows 11 machine running 8.1.16, when a GPU work unit completes the CPU work unit gets interrupted & restarts. In the following log snippet, WU922 is a GPU WU that completes, and WU920 is the CPU WU that gets interrupted.

Log showing restart ``` 07:33:30:I1::WU922:Completed 2500000 out of 2500000 steps (100%) 07:33:30:I1::WU922:Average performance: 83.8835 ns/day 07:33:30:I1::WU922:Checkpoint completed at step 2500000 07:33:33:I1::WU922:Saving result file ..\logfile_01.txt 07:33:33:I1::WU922:Saving result file checkpointIntegrator.xml 07:33:33:I1::WU922:Saving result file checkpointState.xml 07:33:33:I1::WU922:Saving result file positions.xtc 07:33:33:I1::WU922:Saving result file science.log 07:33:33:I1::WU922:Saving result file xtcAtoms.csv.bz2 07:33:33:I1::WU922:Folding@home Core Shutdown: FINISHED_UNIT 07:33:34:I1::WU922:Core returned FINISHED_UNIT (100) 07:33:35:I1::Added new work unit: cpus:0 gpus:gpu:02:00:00 07:33:35:I1::WU926:Requesting WU assignment for user Jon_Ault team 35054 07:33:35:I1:OUT23:> POST https://assign1.foldingathome.org/api/assign HTTP/1.1 07:33:35:I3:Connecting to assign1.foldingathome.org:443 07:33:35:I1::WU922:Uploading WU results 07:33:35:I1::WU920:WARNING:Console control signal 1 on PID 2940 07:33:35:I1::WU920:Exiting, please wait. . . 07:33:35:I1:OUT24:> POST https://vav19.fah.temple.edu/api/results HTTP/1.1 07:33:35:I3:Connecting to vav19.fah.temple.edu:443 07:33:35:I1:OUT23:< assign1.foldingathome.org:443 HTTP/1.1 200 HTTP_OK 07:33:35:I1::WU926:Received WU assignment EYMQte3vhl9fHqBlTEONqLu9GlAgaxbsmW6yWHtoSrw 07:33:35:I1::WU926:Downloading WU 07:33:35:I1:OUT25:> POST https://ds03.scs.illinois.edu/api/assign HTTP/1.1 07:33:35:I3:Connecting to ds03.scs.illinois.edu:443 07:33:35:I1::WU920:Folding@home Core Shutdown: INTERRUPTED 07:33:36:I1::WU920:Core returned INTERRUPTED (102) 07:33:36:I3::WU920:Running FahCore: C:\ProgramData\FAHClient\cores/fahcore-a8-win-64bit-avx2_256-0.0.12/FahCore_a8.exe -dir wmJyf7aUkCkJuXiFRGgKqUA42ncVRtkAiAnOXnmIMmY -suffix 01 -version 8.1.16 -lifeline 2404 -np 6 07:33:36:I3::WU920:Started FahCore on PID 19696 07:33:37:I1::WU920:*********************** Log Started 2023-03-10T07:33:36Z *********************** 07:33:37:I1::WU920:************************** Gromacs Folding@home Core *************************** 07:33:37:I1::WU920: Core: Gromacs 07:33:37:I1::WU920: Type: 0xa8 07:33:37:I1::WU920: Version: 0.0.12 07:33:37:I1::WU920: Author: Joseph Coffland 07:33:37:I1::WU920: Copyright: 2020 foldingathome.org 07:33:37:I1::WU920: Homepage: https://foldingathome.org/ 07:33:37:I1::WU920: Date: Jan 16 2021 07:33:37:I1::WU920: Time: 12:29:40 07:33:37:I1::WU920: Revision: c5816759c404e4b65f9f364c3d1ef554a67c4225 07:33:37:I1::WU920: Branch: master 07:33:37:I1::WU920: Compiler: Visual C++ 2019 16.7 07:33:37:I1::WU920: Options: /TP /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:37:I1::WU920: Platform: win32 10 07:33:37:I1::WU920: Bits: 64 07:33:37:I1::WU920: Mode: Release 07:33:37:I1::WU920: SIMD: avx2_256 07:33:37:I1::WU920: OpenMP: ON 07:33:37:I1::WU920: CUDA: OFF 07:33:37:I1::WU920: Args: -dir wmJyf7aUkCkJuXiFRGgKqUA42ncVRtkAiAnOXnmIMmY -suffix 01 07:33:37:I1::WU920: -version 8.1.16 -lifeline 2404 -np 6 07:33:37:I1::WU920:************************************ libFAH ************************************ 07:33:37:I1::WU920: Date: Jan 16 2021 07:33:37:I1::WU920: Time: 11:24:13 07:33:37:I1::WU920: Revision: c5816759c404e4b65f9f364c3d1ef554a67c4225 07:33:37:I1::WU920: Branch: master 07:33:37:I1::WU920: Compiler: Visual C++ 2019 16.7 07:33:37:I1::WU920: Options: /TP /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:37:I1::WU920: Platform: win32 10 07:33:37:I1::WU920: Bits: 64 07:33:37:I1::WU920: Mode: Release 07:33:37:I1::WU920:************************************ CBang ************************************* 07:33:37:I1::WU920: Date: Jan 16 2021 07:33:37:I1::WU920: Time: 11:23:53 07:33:37:I1::WU920: Revision: c5816759c404e4b65f9f364c3d1ef554a67c4225 07:33:37:I1::WU920: Branch: master 07:33:37:I1::WU920: Compiler: Visual C++ 2019 16.7 07:33:37:I1::WU920: Options: /TP /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:37:I1::WU920: Platform: win32 10 07:33:37:I1::WU920: Bits: 64 07:33:37:I1::WU920: Mode: Release 07:33:37:I1::WU920:************************************ System ************************************ 07:33:37:I1::WU920: CPU: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz 07:33:37:I1::WU920: CPU ID: GenuineIntel Family 6 Model 158 Stepping 12 07:33:37:I1::WU920: CPUs: 8 07:33:37:I1::WU920: Memory: 15.91GiB 07:33:37:I1::WU920:Free Memory: 9.47GiB 07:33:37:I1::WU920: Threads: WINDOWS_THREADS 07:33:37:I1::WU920: OS Version: 6.2 07:33:37:I1::WU920:Has Battery: true 07:33:37:I1::WU920: On Battery: false 07:33:37:I1::WU920: UTC Offset: -6 07:33:37:I1::WU920: PID: 19696 07:33:37:I1::WU920: CWD: C:\ProgramData\FAHClient\work 07:33:37:I1::WU920:******************************************************************************** 07:33:37:I1::WU920:Project: 16996 (Run 7, Clone 8, Gen 182) 07:33:37:I1::WU920:Unit: 0x00000000000000000000000000000000 07:33:37:I1::WU920:Digital signatures verified 07:33:37:I1::WU920:Calling: mdrun -c frame182.gro -s frame182.tpr -x frame182.xtc -cpi state.cpt -cpt 5 -nt 6 -ntmpi 1 07:33:37:I1::WU920:Steps: first=455000000 total=457500000 07:33:37:I1::WU920:Completed 825902 out of 2500000 steps (33%) 07:33:42:I1:OUT25:< ds03.scs.illinois.edu:443 HTTP/1.1 200 HTTP_OK 07:33:42:I1::WU926:Received WU 07:33:42:I1::WU926:CORE 100% 1B of 1B 07:33:42:I3::WU926:Running FahCore: C:\ProgramData\FAHClient\cores/openmm-core-22/fahcore-22-windows-64bit-release-0.0.20/FahCore_22.exe -dir EYMQte3vhl9fHqBlTEONqLu9GlAgaxbsmW6yWHtoSrw -suffix 01 -version 8.1.16 -lifeline 2404 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-platform 0 -cuda-device 0 -gpu 0 07:33:42:I3::WU926:Started FahCore on PID 17536 07:33:43:I1::WU926:*********************** Log Started 2023-03-10T07:33:42Z *********************** 07:33:43:I1::WU926:*************************** Core22 Folding@home Core *************************** 07:33:43:I1::WU926: Core: Core22 07:33:43:I1::WU926: Type: 0x22 07:33:43:I1::WU926: Version: 0.0.20 07:33:43:I1::WU926: Author: Joseph Coffland 07:33:43:I1::WU926: Copyright: 2020 foldingathome.org 07:33:43:I1::WU926: Homepage: https://foldingathome.org/ 07:33:43:I1::WU926: Date: Jan 20 2022 07:33:43:I1::WU926: Time: 01:15:36 07:33:43:I1::WU926: Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c 07:33:43:I1::WU926: Branch: HEAD 07:33:43:I1::WU926: Compiler: Visual C++ 07:33:43:I1::WU926: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:43:I1::WU926: -DOPENMM_VERSION="\"7.7.0\"" 07:33:43:I1::WU926: Platform: win32 10 07:33:43:I1::WU926: Bits: 64 07:33:43:I1::WU926: Mode: Release 07:33:43:I1::WU926:Maintainers: John Chodera and Peter Eastman 07:33:43:I1::WU926: 07:33:43:I1::WU926: Args: -dir EYMQte3vhl9fHqBlTEONqLu9GlAgaxbsmW6yWHtoSrw -suffix 01 07:33:43:I1::WU926: -version 8.1.16 -lifeline 2404 -gpu-vendor nvidia -opencl-platform 07:33:43:I1::WU926: 0 -opencl-device 0 -cuda-platform 0 -cuda-device 0 -gpu 0 07:33:43:I1::WU926:************************************ libFAH ************************************ 07:33:43:I1::WU926: Date: Jan 20 2022 07:33:43:I1::WU926: Time: 01:14:17 07:33:43:I1::WU926: Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f 07:33:43:I1::WU926: Branch: HEAD 07:33:43:I1::WU926: Compiler: Visual C++ 07:33:43:I1::WU926: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:43:I1::WU926: Platform: win32 10 07:33:43:I1::WU926: Bits: 64 07:33:43:I1::WU926: Mode: Release 07:33:43:I1::WU926:************************************ CBang ************************************* 07:33:43:I1::WU926: Date: Jan 20 2022 07:33:43:I1::WU926: Time: 01:13:20 07:33:43:I1::WU926: Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a 07:33:43:I1::WU926: Branch: HEAD 07:33:43:I1::WU926: Compiler: Visual C++ 07:33:43:I1::WU926: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT 07:33:43:I1::WU926: Platform: win32 10 07:33:43:I1::WU926: Bits: 64 07:33:43:I1::WU926: Mode: Release 07:33:43:I1::WU926:************************************ System ************************************ 07:33:43:I1::WU926: CPU: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz 07:33:43:I1::WU926: CPU ID: GenuineIntel Family 6 Model 158 Stepping 12 07:33:43:I1::WU926: CPUs: 8 07:33:43:I1::WU926: Memory: 15.91GiB 07:33:43:I1::WU926:Free Memory: 9.38GiB 07:33:43:I1::WU926: Threads: WINDOWS_THREADS 07:33:43:I1::WU926: OS Version: 6.2 07:33:43:I1::WU926:Has Battery: true 07:33:43:I1::WU926: On Battery: false 07:33:43:I1::WU926: UTC Offset: -6 07:33:43:I1::WU926: PID: 17536 07:33:43:I1::WU926: CWD: C:\ProgramData\FAHClient\work 07:33:43:I1::WU926:************************************ OpenMM ************************************ 07:33:43:I1::WU926: Version: 7.7.0 07:33:43:I1::WU926:******************************************************************************** 07:33:43:I1::WU926:Project: 19218 (Run 326, Clone 3, Gen 16) 07:33:43:I1::WU926:Reading tar file core.xml 07:33:43:I1::WU926:Reading tar file integrator.xml 07:33:43:I1::WU926:Reading tar file state.xml 07:33:43:I1::WU926:Reading tar file system.xml 07:33:43:I1::WU926:Digital signatures verified 07:33:43:I1::WU926:Folding@home GPU Core22 Folding@home Core 07:33:43:I1::WU926:Version 0.0.20 07:33:43:I1::WU926: Checkpoint write interval: 62500 steps (5%) [20 total] 07:33:43:I1::WU926: JSON viewer frame write interval: 12500 steps (1%) [100 total] 07:33:43:I1::WU926: XTC frame write interval: 25000 steps (2%) [50 total] 07:33:43:I1::WU926: Global context and integrator variables write interval: disabled 07:33:43:I1::WU926:There are 4 platforms available. 07:33:43:I1::WU926:Platform 0: Reference 07:33:43:I1::WU926:Platform 1: CPU 07:33:43:I1::WU926:Platform 2: OpenCL 07:33:43:I1::WU926: opencl-device 0 specified 07:33:43:I1::WU926:Platform 3: CUDA 07:33:43:I1::WU926: cuda-device 0 specified 07:33:55:I1::WU926:Attempting to create CUDA context: 07:33:55:I1::WU926: Configuring platform CUDA 07:34:00:I1::WU926: Using CUDA and gpu 0 07:34:00:I1::WU926:Completed 0 out of 1250000 steps (0%) 07:34:01:I1::WU926:Checkpoint completed at step 0 07:34:06:I1:OUT24:< vav19.fah.temple.edu:443 HTTP/1.1 200 HTTP_OK 07:34:06:I1::WU922:Credited 07:34:21:I1::WU926:Completed 12500 out of 1250000 steps (1%) ```
jon-ault commented 1 year ago

Also, if there's a problem downloading a new GPU work unit & the client has to make multiple requests to get one, the CPU work unit gets interrupted on each server request.

jcoffland commented 1 year ago

What's happening is that the CPU WU is adjusting to take up or give back the extra CPU that the GPU WU uses.

jcoffland commented 7 months ago

This should be fixed.