Configure `scram` to disable one of the GPU bakends in a development area ?

fwyzard commented 2 months ago

The ROCm (and to some extend CUDA) alpaka backends add a noticeable amount to the time it takes to build some packages.

For users that do not care about running on (AMD) GPUs, we could speed up the compilation process disabling the ROCm (or CUDA) alpaka backend(s).

Also note that it could be much worse if we manage to add the SYCL/oneAPI backend...

This could be implemented in scram, with a syntax like

scram b disable-backend {cuda,rocm}
scram b enable-backend {cuda,rocm}

?

An other way to speed up the compilation would be to target only one actual GPU type, like an NVIDIA T4 or an AMD Mi250.

This could be implemented with a syntax like

scram b enable-backend cuda=sm_89
scram b enable-backend rocm=gfx90a

We could also get the hardware type from cudaComputeCapabilities or rocmComputeCapabilities with a syntax like

scram b enable-backend cuda=native
scram b enable-backend rocm=native

@smuzaffar do you think this could be implemented in scram ?

If you think so, we can discuss the implementation detail here or in person.

fwyzard commented 2 months ago

assign core,heterogeneous

cmsbuild commented 2 months ago

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 months ago

cms-bot internal usage

cmsbuild commented 2 months ago

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

smuzaffar commented 2 months ago

@fwyzard , yes we should be able to implement this via scram b ..... How about , for local development (e.g. where user only wants to test things on local host) we can just use scram build enable-alpaka-native which on host with

nvidia gpu: disable rocm, use cudaComputeCapabilities to get the get actual gpu and only build for that gpu type
amd gpu: disable cuda, use rocmComputeCapabilities to get the get actual gpu and only build for that gpu type
without gpu: disable both rocm and cuda backends

we can also add scram b {enable|disable}-alpaka-{rocm|cuda} for expicitly enable/disable rocm/cuda backend build

If needed, we can discuss this in core sw meeting tomorrow

fwyzard commented 2 months ago

If needed, we can discuss this in core sw meeting tomorrow

Sounds good.

fwyzard commented 2 months ago

About updating the flags in cuda.xml and rocm.xml tools.

`cuda.xml`

The syntax for enabling sm_## is -gencode arch=compute_##,code=[sm_##,compute_##]. So, calling e.g.

scram b enable-backend cuda=sm_89

should remove all the CUDA_FLAGS of the form -gencode arch=compute_[0-9]+,code=[sm_[0-9]+,compute_[0-9]+], and add -gencode arch=compute_89,code=[sm_89,compute_89].

The "native" CUDA architectures used by the NVIDIA GPUs in the local machine can be extracted from cudaComputeCapabilities:

$ cudaComputeCapabilities 
   0     8.9    NVIDIA L4
   1     7.5    Tesla T4

should use the architecture sm_75.

Currently there is a script cmsCudaSetup.sh, that does part of what scram b enable-backend cuda=native should do.

`rocm.xml`

The syntax for enabling gfx#### is --offload-arch=gfx####, so

scram b enable-backend rocm=gfx1100

should remove all the ROCM_FLAGS of the form --offload-arch=gfx[0-9a-f]+, and add --offload-arch=gfx1100.

Note that the value after gfx can have 3 or 4 hexadecimal digits.

The "native" ROCm architectures used by the AMD GPUs in the local machine can be extracted from rocmComputeCapabilities:

$ rocmComputeCapabilities 
   0     gfx1100    AMD Radeon Pro W7800 (unsupported)

smuzaffar commented 2 months ago

@fwyzard , thanks for the hints in https://github.com/cms-sw/cmssw/issues/45859#issuecomment-2327024351. As scram build ... passes every thing to gmake as build targets so it is not easy to implement scram build enable-backend cuda as in this case cuda becomes a build target and gmake will try to run it OR scram build enable-backend cuda=sm_89 in this case cuda becomes a variable overriding its value set by cuda tool. Instead how about

scram build {en,dis}able-backend-{cuda,rocm}: To enable/disable cuda/rocm alpaka backends
scram build enable-backend-{cuda,rocm}-[comma-separated-compute-capabilities] e.g
- scram build enable-backend-cuda-sm_75 or scram build enable-backend-cuda-sm_75,sm_89
- scram build enable-backend-rocm-gfx1100 or scram build enable-backend-rocm-gfx1100,gfx90a
- scram build enable-backend-cuda-native: To find the native compute capabilities and use those
- scram build enable-backend-cuda-reset: To reset the compute capabilities to their original value ( from the release area)
scram build enable-backend-native: To disable the backend not available and call enable-backend-cuda-native for the backend which is available

fwyzard commented 2 months ago

I see.

Maybe we could shorten the commands, like

scram build {en,dis}able-{cuda,rocm}
scram build enable-cuda-sm_75
scram build enable-rocm-gfx1100,gfx90a

etc?

And it might be more clear if we split the backend and individual targets with a :

scram build enable-cuda:sm_75
scram build enable-rocm:gfx1100,gfx90a

(I would suggest using = but Make would interpret it as setting a variable)

What do you think ?

smuzaffar commented 2 months ago

sounds good, so I will drop -backend from the target and use : for the compute capabilities

smuzaffar commented 2 months ago

@fwyzard , for now I have enable-alpaka:native to automatically enable/disable cuda/rocm backend and set the native compute capabilities. Is this a good target name of should I change it to enable-alpaka-native ( enable-native sounds very generic )

fwyzard commented 2 months ago

Maybe enable-gpus:native ?

fwyzard commented 2 months ago

But it affects only Alpaka modules, not other modules that may use the process.options.accelerators, right ?

Then enable-alpaka:native may be more correct.

smuzaffar commented 2 months ago

yes it only afftects the alpaka modules. OK so I will go with enable-alpaka:native then

smuzaffar commented 2 months ago

@fwyzard , {en,dis}able-{cuda,rocm} also affect alpaka only, should we change these to {en,dis}able-alpaka:{cuda,rocm} ?

fwyzard commented 2 months ago

I'm undecided, because then calls like scram b enable-alpaka:cuda:sm_75 starts to become complicated.

fwyzard commented 2 months ago

So I'm leaning more towards scram b enable-gpus:native.

Could you implement that, and later today we ask @makortel his opinion ?

smuzaffar commented 2 months ago

As enable-{cuda,rocm}:capabilities only affects cuda/rocm directly so those call can remain enable-{cuda,rocm}:capability.

fwyzard commented 2 months ago

What about disable-cuda ?

smuzaffar commented 2 months ago

currently disable-cuda only disables the alpaka-cuda backend. It does not disable the cuda build rules so scram will still compile .cu files for non-alpaka packages

smuzaffar commented 2 months ago

But if we want disable-cuda to disable both alpaka-cuda backend and also stop building .cu files then I can do it but I think for now that will break builds ( there are packages which has gpu code depenency)

fwyzard commented 2 months ago

OK, let me try to summarise:

scram b disable-cuda
- ❌ does not build the alpaka CUDA backend
- ✔️ builds regular .cu files
scram b disable-rocm:
- ❌ does not build the alpaka ROCm backend
- ✔️ builds regular .hip.cc files
scram b enable-cuda:
- ✔️ build the alpaka CUDA backend
- ✔️ builds regular .cu files
scram b enable-cuda:sm_90:
- changes the cuda.xml tool file to support (only) the sm_90 architecture
- ✔️ build the alpaka CUDA backend
- ✔️ builds regular .cu files
scram b enable-cuda:native:
- ❔ uses cudaComputeCapabilities to determine the architecture of the NVIDIA GPUs in the system
- ✔️ changes the cuda.xml tool file to support (only) these architectures
- ✔️ build the alpaka CUDA backend
- ✔️ builds regular .cu files
scram b enable-rocm, enable-rocm:gfx1100, enable-rocm:native:
- same for the AMD GPUs, .hip.cc files, and ROCm alpaka backend
scram b enable-alpaka:native
- ❔ checks for both NVIDIA and AMD GPUs
- ✔️ updates the corresponding tool file to support (only) the GPUs present on the system,
- ❔ enable only the alpaka backend for the GPUs present on the system
- ✔️ builds all regular .cu and .hip.cc files.

Is it correct ?

Basically, it would never affect whether the regular .cu and .hip.cc files are built (other than which architecture is built), only whether the alpaka backends are built or not.

So I think I would prefer scram b enable-gpus:native :-)

fwyzard commented 2 months ago

And, once https://github.com/cms-sw/cmssw/issues/45844 is complete, we could revisit this

currently disable-cuda only disables the alpaka-cuda backend. It does not disable the cuda build rules so scram will still compile .cu files for non-alpaka packages

and try to disable the CUDA or ROCm backends completely.

smuzaffar commented 2 months ago

Is it correct ?

yes this is correct.

So I think I would prefer scram b enable-gpus:native

OK

smuzaffar commented 2 months ago

https://github.com/cms-sw/cmssw-config/pull/110 should implement these new rules. scram build help in dev area should show these new build rules

makortel commented 2 months ago

I'd find it clearest if the {enable,disable}-{cuda,rocm} and enable-gpus:native would apply equally to the compilation of .cu and .hip.cc files as well. But to be practical I'm ok with leaving that to the time #45844 becomes complete.

cms-sw / cmssw

Configure `scram` to disable one of the GPU bakends in a development area ? #45859

`cuda.xml`

`rocm.xml`