Introduces another dependent CMake option DLAF_WITH_MPI_GPU_SUPPORT_FORCE_CONTIGUOUS which I propose to default to ON when DLAF_WITH_MPI_GPU_SUPPORT is enabled, given that it seems to always perform better than not forcing contiguous buffers (small improvement with HIP, huge improvement with CUDA). I'd say long-term we can consider if it should just always be used without a CMake option, but I think at this early stage when we're still figuring out what works where I think it's nice to have an option for it.
Independent of the option we could also consider making both DLAF_WITH_MPI_GPU_SUPPORT and DLAF_MPI_GPU_SUPPORT_FORCE_CONTIGUOUS runtime options, but that would require larger changes and is a separate discussion.
In this PR I'm only forcing GPU buffers to be contiguous if the option is enabled. CPU buffers are left as before.
I've currently forced the use of contiguous buffers at the call-site of withTemporaryTile. This could also be done inside withTemporaryTile but at least in theory withTemporaryTile is not meant only for communication so I don't know if it makes sense. Communication is, however, the only use case for it right now. I'm open to thoughts on what you think is cleaner.
Based on #1088.
Introduces another dependent CMake option
DLAF_WITH_MPI_GPU_SUPPORT_FORCE_CONTIGUOUS
which I propose to default toON
whenDLAF_WITH_MPI_GPU_SUPPORT
is enabled, given that it seems to always perform better than not forcing contiguous buffers (small improvement with HIP, huge improvement with CUDA). I'd say long-term we can consider if it should just always be used without a CMake option, but I think at this early stage when we're still figuring out what works where I think it's nice to have an option for it.Independent of the option we could also consider making both
DLAF_WITH_MPI_GPU_SUPPORT
andDLAF_MPI_GPU_SUPPORT_FORCE_CONTIGUOUS
runtime options, but that would require larger changes and is a separate discussion.In this PR I'm only forcing GPU buffers to be contiguous if the option is enabled. CPU buffers are left as before.
I've currently forced the use of contiguous buffers at the call-site of
withTemporaryTile
. This could also be done insidewithTemporaryTile
but at least in theorywithTemporaryTile
is not meant only for communication so I don't know if it makes sense. Communication is, however, the only use case for it right now. I'm open to thoughts on what you think is cleaner.