GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type

sr1gh commented 3 months ago

Describe the bug If GPU computation is suspended during use or when an exclusive application is running, when computation resumes, BOINC sometimes swaps which task is on which GPU. This causes a computation error for asteroids@home tasks when using multiple AMD GPUs, for example, an RX 7600 XT and an RX 6600. This might be an application specific issue, but it might be a good idea to have an option to not switch tasks between GPUs if possible, unless, for example, one GPU is removed, in which case all the tasks would have to run on the remaining GPU.

Steps To Reproduce

Start the asteroids@home period search application on a system with 2 different AMD gpus
Suspend computation partway through the computation, observing which task is on which gpu, the resume computation
Repeat if necessary until BOINC switches tasks between gpus resulting in a computation error.

Expected behavior I would expect the task to stay on the GPU it started on if that is necessary for the task to finish. An option to disable gpu task switching is a potential solution, or tasks could specify weather or not they can be switched.

Screenshots

System Information

OS: Windows 10 (Latest)
BOINC Version: 8.0.4

Additional context

AenBleidd commented 3 months ago

I'm pretty much sure this is an issue of the project application, because for every task we assign on start-up ID (0, 1, 2, etc) of the GPU to be used. If the project application doesn't use it but instead relies on some other mechanism - then there might be a collision. @sr1gh, may I ask you for a favor? Could you please go to the %BOINCDATA%\slots\%N%\ where

%BOINDATA% is the folder where your BOINC data is located (usually C:\ProgramData\BOINC)
%N% number of the slot locate there two running tasks in the %N% folders (two different folders), open their init_data.xml and check <gpu_device_num> value. These numbers should be different for different tasks, but should stay consistent after task is suspended and run again. If these numbers stay the same but the application crashes - then this is definitely an issue with the project application, and it should be reported to their admins. Anyway please report these results back to us, and we will check that there is no issue on our side.

davidpanderson commented 3 months ago

AFAIK the client doesn't have a mechanism for pinning a job to a GPU. I need to verify that it rewrites init_data.xml before restarting a job; otherwise a collision could happen.

RichardHaselgrove commented 3 months ago

The same issue is a significant problem at GPUGrid.

init_data contains the correct for a running task. But if BOINC is stopped and restarted, there is no guarantee that the same GPU will be assigned by BOINC. If the new GPU is identical to the previous run, the task restarts normally.

But if it is not identical, the task crashes, potentially losing several hours of work. The crash is initiated by the project application, but could be prevented by the BOINC client remembering and reusing the device allocation at startup.

NB consider respecting previous OpenCL device numbers too, although I've only seen the problem for cuda apps.

davidpanderson commented 3 months ago

The issue is whether the task crashes because it runs an a different GPU than where it started, or because 2 tasks are trying to use the same GPU.

The former seems odd - why would a checkpoint file be specific to a GPU instance?

RichardHaselgrove commented 3 months ago

I'm looking through my recent errors for an example of the specific failure case, but I haven't found one yet.

From memory, the problem occurs from the 'just in time' GPU code compiler. At GPUGrid, this produces code which is specific to the individual GPU type used in the first run, If the second GPU is different, the by now pre-compiled code is incompatible with the hardware.

RichardHaselgrove commented 3 months ago

Can't find an error on my own machines - I know from bitter experience that I have to avoid shutdowns when GPUGrid work is running.

But see https://www.gpugrid.net/forum_thread.php?id=5461 for a report/response on their message board.

AenBleidd commented 3 months ago

The former seems odd - why would a checkpoint file be specific to a GPU instance?

@davidpanderson, as @RichardHaselgrove already mentioned, it's very important that the task that started to run on particular GPU will stick to it forever, otherwise it's not guaranteed that the computation could be continued even from the checkpoint.

sr1gh commented 3 months ago

Yes, it appears that some GPU apps generate the code for the specific hardware used. Here is the error output from a failed task from asteroids at home.

BOINC client version 8.0.4 BOINC GPU type 'ATI', deviceId=1, slot=0 Application: period_search_10220_windows_x86_64__opencl_102_amd_win.exe Version: 102.20.0.0 Platform name: AMD Accelerated Parallel Processing Platform vendor: Advanced Micro Devices, Inc. OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0) OpenCL device Id: 1 OpenCL device name: AMD Radeon RX 6600 7GB Device driver version: 3617.0 (PAL,LC) Multiprocessors: 14 Max Samplers: 16 Max work item dimensions: 3 Resident blocks per multiprocessor: 16 Grid dim: 448 = 2 * 14 * 16 Block dim: 128 Binary build log for AMD Radeon RX 6600: OK (0) Program build log for AMD Radeon RX 6600: OK (0) Prefered kernel work group size multiple: 32 Setting Grid Dim to 256 Platform name: AMD Accelerated Parallel Processing Platform vendor: Advanced Micro Devices, Inc. OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0) OpenCL device Id: 0 OpenCL device name: AMD Radeon RX 7600 XT 15GB Device driver version: 3617.0 (PAL,LC) Multiprocessors: 16 Max Samplers: 16 Max work item dimensions: 3 Resident blocks per multiprocessor: 16 Grid dim: 512 = 2 * 16 * 16 Block dim: 128 Build log: AMD Accelerated Parallel Processing | AMD Radeon RX 7600 XT: Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102 Error: create kernel metadata map using COMgr Error: Cannot Find Global Var Sizes Error: Cannot create kernels. Error creating queue: build program failure(-11)

sr1gh commented 3 months ago

The values appeared the same after resuming computation, but in BOINC manager, the task that said "device 0" likely said "device 1" before the error, but the error happens immediately after resuming, so it is hard to tell, although I have seen this swap occur with other applications from other projects. And the following error from the above post would indicate that the tasks are sometimes swapping GPUs:

Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102

gfx1032 is RX 6600 gfx1102 is RX 7600 XT

davidpanderson commented 3 months ago

One option would be for the app to compile its kernels each time it starts.

davidpanderson commented 3 months ago

If we pin each GPU job to a GPU instance, the following could happen: jobs A and B are running on GPUs 0 and 1 respectively. Job C arrives, with an early deadline, so it preempts job A and starts running on GPU 0. Job B finishes.

We now have 2 jobs pinned to GPU 0; GPU 1 is idle. The work fetch logic (which doesn't know about GPU assignments) thinks that both GPUs are busy, so it doesn't fetch more jobs.

To avoid this, we'd have to extend the simulation done by the work fetch logic to model GPU assignments (in addition to per-project GPU exclusions, max concurrency, etc.). This would be quite difficult. It would be better if apps could recompile their kernels on startup.

BOINC / boinc

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743