Open sr1gh opened 3 months ago
I'm pretty much sure this is an issue of the project application, because for every task we assign on start-up ID (0, 1, 2, etc) of the GPU to be used. If the project application doesn't use it but instead relies on some other mechanism - then there might be a collision. @sr1gh, may I ask you for a favor? Could you please go to the %BOINCDATA%\slots\%N%\ where
init_data.xml
and check <gpu_device_num>
value.
These numbers should be different for different tasks, but should stay consistent after task is suspended and run again.
If these numbers stay the same but the application crashes - then this is definitely an issue with the project application, and it should be reported to their admins.
Anyway please report these results back to us, and we will check that there is no issue on our side.AFAIK the client doesn't have a mechanism for pinning a job to a GPU. I need to verify that it rewrites init_data.xml before restarting a job; otherwise a collision could happen.
The same issue is a significant problem at GPUGrid.
init_data contains the correct
But if it is not identical, the task crashes, potentially losing several hours of work. The crash is initiated by the project application, but could be prevented by the BOINC client remembering and reusing the device allocation at startup.
NB consider respecting previous OpenCL device numbers too, although I've only seen the problem for cuda apps.
The issue is whether the task crashes because it runs an a different GPU than where it started, or because 2 tasks are trying to use the same GPU.
The former seems odd - why would a checkpoint file be specific to a GPU instance?
I'm looking through my recent errors for an example of the specific failure case, but I haven't found one yet.
From memory, the problem occurs from the 'just in time' GPU code compiler. At GPUGrid, this produces code which is specific to the individual GPU type used in the first run, If the second GPU is different, the by now pre-compiled code is incompatible with the hardware.
Can't find an error on my own machines - I know from bitter experience that I have to avoid shutdowns when GPUGrid work is running.
But see https://www.gpugrid.net/forum_thread.php?id=5461 for a report/response on their message board.
The former seems odd - why would a checkpoint file be specific to a GPU instance?
@davidpanderson, as @RichardHaselgrove already mentioned, it's very important that the task that started to run on particular GPU will stick to it forever, otherwise it's not guaranteed that the computation could be continued even from the checkpoint.
Yes, it appears that some GPU apps generate the code for the specific hardware used. Here is the error output from a failed task from asteroids at home.
The
Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102
gfx1032 is RX 6600 gfx1102 is RX 7600 XT
One option would be for the app to compile its kernels each time it starts.
If we pin each GPU job to a GPU instance, the following could happen: jobs A and B are running on GPUs 0 and 1 respectively. Job C arrives, with an early deadline, so it preempts job A and starts running on GPU 0. Job B finishes.
We now have 2 jobs pinned to GPU 0; GPU 1 is idle. The work fetch logic (which doesn't know about GPU assignments) thinks that both GPUs are busy, so it doesn't fetch more jobs.
To avoid this, we'd have to extend the simulation done by the work fetch logic to model GPU assignments (in addition to per-project GPU exclusions, max concurrency, etc.). This would be quite difficult. It would be better if apps could recompile their kernels on startup.
Describe the bug If GPU computation is suspended during use or when an exclusive application is running, when computation resumes, BOINC sometimes swaps which task is on which GPU. This causes a computation error for asteroids@home tasks when using multiple AMD GPUs, for example, an RX 7600 XT and an RX 6600. This might be an application specific issue, but it might be a good idea to have an option to not switch tasks between GPUs if possible, unless, for example, one GPU is removed, in which case all the tasks would have to run on the remaining GPU.
Steps To Reproduce
Expected behavior I would expect the task to stay on the GPU it started on if that is necessary for the task to finish. An option to disable gpu task switching is a potential solution, or tasks could specify weather or not they can be switched.
Screenshots
System Information
Additional context