cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

update handling of GPU workflows in runTheMatrix.py #46069

Open fwyzard opened 3 hours ago

fwyzard commented 3 hours ago

runTheMatrix.py has some GPU-related options:

GPU-related options:
  These options are only meaningful when --gpu is used, and is not set to forbidden.

  --gpu [{forbidden,optional,required}], --requires-gpu [{forbidden,optional,required}]
                        Enable GPU workflows. Possible options are "forbidden" (default), "required" (implied if no argument is given), or "optional". (default: forbidden)
  --gpu-memory GPUMEMORYMB
                        Specify the minimum amount of GPU memory required by the job, in MB. (default: 8000)
  --cuda-capabilities CUDACAPABILITIES
                        Specify a comma-separated list of CUDA "compute capabilities", or GPU hardware architectures, that the job can use. (default: 6.0,6.1,6.2,7.0,7.2,7.5,8.0,8.6)
  --cuda-runtime CUDARUNTIME
                        Specify major and minor version of the CUDA runtime used to build the application. (default: 12.4)
  --force-gpu-name GPUNAME
                        Request a specific GPU model, e.g. "Tesla T4" or "NVIDIA GeForce RTX 2080". The default behaviour is to accept any supported GPU. (default: )
  --force-cuda-driver-version CUDADRIVERVERSION
                        Request a specific CUDA driver version, e.g. 470.57.02. The default behaviour is to accept any supported CUDA driver version. (default: )
  --force-cuda-runtime-version CUDARUNTIMEVERSION
                        Request a specific CUDA runtime version, e.g. 11.4. The default behaviour is to accept any supported CUDA runtime version. (default: )

However, they affect only the creation of WMAgent (?) workflows, not the actual content of the workflow generated by cmsDriver.py and executed by cmsRun.


I would like to propose two changes:

  1. change the default for the --gpu option from forbidden to optional;
  2. propagate the meaning of the --gpu option to cmsDriver, via the --accelerators option.

The first change is IMHO something we should do in its own right, but here it is motivated by minimising the impact of the second change on the cmsDriver workflows.


The second change proposes to map:

By default cmsDriver does not impose any restrictions on the usage of GPUs. Passing --accelerators cpu sets the job's process.options.accelerators to [ 'cpu' ], which prevents the use of GPUs in a CUDA or Alpaka workflow. Passing --accelerators gpu-* sets the job's process.options.accelerators to [ 'gpu-*' ], which requires the use of GPUs in a CUDA or Alpaka workflow.

The advantage of this approach is that we no longer need to triplicate all Alpaka-related workflows: one version to run on any backend, one version to run only on CPU, one version to run only on GPUs.


As this change would affect O&C and PPD operations, what is their opinion ?

fwyzard commented 3 hours ago

assign core,pdmv

cmsbuild commented 3 hours ago

New categories assigned: core,pdmv

@Dr15Jones,@AdrianoDee,@sunilUIET,@miquork,@makortel,@smuzaffar,@kskovpen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 3 hours ago

cms-bot internal usage

cmsbuild commented 3 hours ago

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 3 hours ago

@sextonkennedy @srimanob @AdrianoDee FYI

fwyzard commented 3 hours ago

@vlimant @malbouis FYI

fwyzard commented 3 hours ago

@ckoraka FYI