ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
48 stars 17 forks source link

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Closed therault closed 6 months ago

therault commented 7 months ago

In https://github.com/ICLDisco/parsec/pull/570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening.

In this PR, we add a step, after all devices have been registered, to complete this initialization.

An alternative is to come back to using cuda_index-based bitmasks, but then the decision if two GPUs can directly communicate is device-type specific and needs to be moved from device_gpu.c to device_cuda_module.c, which means we add another function call to device_cuda_module.c inside the stage_in() function of device_gpu.c

bosilca commented 7 months ago

Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using the device_index in the mask makes little sense. Going back and rebuilding this mask based on cuda_index seems like a simpler and most resilient solution (we could change the device_index without having to rebuild all the info).

therault commented 7 months ago

device_gpu.c doesn't know the cuda_index (or hip_index or level_zero_index), that's the reason why we moved that mask to global device_index based.

I don't think we want to replicate the full parsec_device_gpu_stage_in() function in all device implementations, so if we want to go back to cuda_index based bitmask, we can replace the test

if( gpu_device->peer_access_mask & (1 << candidate_dev->super.device_index) ) (and similar tests in device_gpu.c) by a call to the device-specific implementation, something like if( gpu_device->peer_access(candidate_dev)

I'm fine with that, but I thought the global device_index mask would save these function calls that cannot be inlined in stage_in, which is taking a lot of time already.

On Fri, Mar 8, 2024 at 3:11 PM bosilca @.***> wrote:

Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using the device_index in the mask makes little sense. Going back and rebuilding this mask based on cuda_index seems like a simpler and most resilient solution (we could change the device_index without having to rebuild all the info).

— Reply to this email directly, view it on GitHub https://github.com/ICLDisco/parsec/pull/642#issuecomment-1986352426, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFEZNVU76DEPMU6TOVWMODYXILN5AVCNFSM6AAAAABENLG62WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGM2TENBSGY . You are receiving this because you authored the thread.Message ID: @.***>

abouteiller commented 6 months ago

Passes all ctests and works with large size POTRF -g 8 with nvlink active on Leconte.

bosilca commented 6 months ago

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

abouteiller commented 6 months ago

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

The issue is not minor at all. Turning back on D2D management unearthed a whole bunch of bugs that would produce wrong results (in particular in TRSM where we would exercise CPU->GPU1,2,3->CPU data motions). While I agree that the description of the PR does not match the real scope of what is achieved here, the changes are not random and serve the greater case of making D2D work at a basic level.

abouteiller commented 6 months ago

I will be extracting the fix for #641 in a separate PR, @therault that means I will be rewriting history in your branch to achieve this sorry.