Closed amaltaro closed 3 years ago
Hi,
Small technical question about GPUParams
string/dict - does ReqMgr2 already expects a string from us or it should be a dict that you can validate parameter by parameter and then stringify it somewhere on your side?
Types: all values in GPUParams
are strings except for memory which is int and CUDACapabilities
is list of strings, right? I am double checking because your example has floats.
As a complement to Alan's comment, I would like to say that CUDACapabilities
should be string.
Hi @justinasr, it's good to make sure we are talking about the same things.
On the McM side, you would provide a string (JSON object) to the GPUParams
parameter, so McM could do something like:
In [15]: GPUParams = {"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"]}
In [16]: GPUParams = json.dumps(GPUParams)
and inject it into ReqMgr2.
On the ReqMgr2 side, we would have to load that string (JSON object) as a python dictionary, so:
In [18]: gpuParams = json.loads(GPUParams)
and then the whole validation would happen on top of this python dictionary. As we discussed in the past, such approach is less memory hungry than a python dictionary, so we better store it as such in central CouchDB.
Types: all values in GPUParams are strings except for memory which is int and CUDACapabilities is list of strings, right?
That's correct. I have fixed the data structure example in the initial description. I also added an example with the full set of parameters.
As a complement to Alan's comment, I would like to say that CUDACapabilities should be string.
Thanks Ceyhun, I have also fixed the CUDACapability vs CUDACapabilities inconsistency in the initial description.
And here is a ticket to address the job creation and submission to HTCondor: https://github.com/dmwm/WMCore/issues/10393
@amaltaro are these parameters
GPUMemory
CUDACapabilities
CUDARuntime
GPUName
CUDADriverVersion
CUDARuntimeVersion
the ones that will be published by the nodes, or the ones that will be requested by the jobs ?
@fwyzard Andrea, these are the parameters to be published by the jobs, so the actual GPU job requirements. The job matchmaking expression - containing the node parameters - is defined by the Submission Infrastructure team, they might clarify what those are in the other issue #10393
Hi Alan - OK, thanks, then the list looks correct, as far as I can say.
By the way, I've posted a brief description of the fields here: https://github.com/cms-sw/cmssw/pull/33057#issuecomment-810914270 .
Thanks for providing a description for all those parameters, Andrea. I have linked them at the top description of this GH issue, but it might be better to expand those in a wiki under this project. Thanks again!
As an outcome of the GPU discussion we had today, I have just updated the initial description of this PR, mostly concerned with the RequiresGPU
parameter with a different set of possible values (and data type different than the initial description).
PR for runTheMatrix is updated, Master: https://github.com/cms-sw/cmssw/pull/33538 11_3: https://github.com/cms-sw/cmssw/pull/33057
Here is how it looks like to step/task that runs GPU.
'Task3': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
'ConfigCacheID': 1042,
'EventStreams': 0,
'GPUParams': {'CUDACapabilities': ['7.5'],
'CUDADriverVersion': '',
'CUDARuntime': '11.2',
'CUDARuntimeVersion': '',
'GPUMemory': '8000',
'GPUName': ''},
'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
'InputFromOutputModule': u'FEVTDEBUGHLToutput',
'InputTask': 'Digi_2021',
'KeepOutput': True,
'LumisPerJob': 10,
'Memory': 3000,
'Multicore': 1,
'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
'RequiresGPU': 'required',
'SplittingAlgo': 'LumiBased',
'TaskName': 'Reco_Patatrack_PixelOnlyGPU_2021'},
In addition, there is a comment on CUDARuntime
in
https://github.com/cms-sw/cmssw/pull/33057#discussion_r621161574
Since it matches with node's CUDACompatibleRuntimes
, should we just change it to be the same? Or it will make the meaning off from what we consider.
I don't think we should change it:
CUDARuntime
is the CUDA runtime version that a job uses;CUDACompatibleRuntimes
is a list of runtimes supported by the node;CUDARuntime
, which is the native version of the CUDA runtime installed on the node.@srimanob the GPUParams
has to be provided as a json object (json.dumps). Please have a look at the discussion Justinas and I had in this issue.
Unfortunately we have not yet started working on this issue on the WMCore side, so we still do not have anything for you to test on. However, if your CMSSW pull requests can wait for a week more (maybe two...), we might be able to get the basic changes in ReqMgr2 such that you could test in testbed.
@justinasr @fwyzard @srimanob @mrceyhun Hi everyone, I just wanted to let you know that I'm starting to work on this issue, based on the initial specification in the first comment of this GH issue.
In short, new workflow/spec parameters are going to be:
RequiresGPU
: with these allowed values forbidden
, or optional
or required
GPUParams
: mandatory if RequiresGPU is optional | required
, otherwise optional (default to empty string)
Please let us know in case you see any inconsistency, or if there are new updates relevant to GPU support.
Thanks @amaltaro I will re-add them to runTheMatrix.
@srimanob Phat, as you can see in the PR providing these features (#10799), the GPUMemory
parameter has been renamed to GPUMemoryMB
, such that it is very explicitly and clear which unit we are supposed to have there. If you see this change to be too problematic, we can discuss it during the day.
@amaltaro GPUMemoryMB is fine for me. I've added the PR at https://github.com/cms-sw/cmssw/pull/35263. Could you please have a look? Thanks.
Impact of the new feature ReqMgr2
Is your feature request related to a problem? Please describe. It's a new project to support GPU processing in central production workflows, where WM system will be the bridge between Offline/CMSSW and the grid resources made available to CMS (through the glideinWMS layer).
There will be a series of such tickets, such that we can try to break work down in even parts.
Describe the solution you'd like In short, we need to support new workflow spec attributes for the ReReco/DataProcessing spec modules.
Now expanding on the important details:
RequiresGPU
, with one of the possible 3 string values:forbidden
: must not use GPUs (default)optional
: can use GPU, if possiblerequired
: must use GPUsGPUParams
, which will contain all the necessary information to match/leverage GPU resources. It's meant to be JSON encoded python dictionary (thus ofstring
type). There are 3 required and 3 optional parameters. The required parameters are:GPUMemory
: integer with the amount of memory, in Megabytes (MB). Validate as> 0
. E.g.: 8000CUDACapabilities
: a list of short strings (could be real though). Validation should ensure at least one item in the list and matching this regex:r"^\d+.\d$"
. E.g.: ["7.5", "8.0"]CUDARuntime
: a short string with the runtime version (could be real though). Validated against this regex:r"^\d+.\d+$"
. E.g.: "11.2" The list of optional parameters are:GPUName
: a string with the GPU name. Validate against<= 100 chars
. E.g. "Tesla T4", "Quadro RTX 6000";CUDADriverVersion
: a string with the CUDA driver version. Validated against this regex:r"^\d+.\d+\d+$"
E.g. "460.32.03"CUDARuntimeVersion
: a string with the CUDA runtime version. Validated against this regex:r"^\d+.\d+\d+$"
E.g. "11.2.152"RequiresGPU
isoptional
orrequired
, then we should enforceGPUParams
to contain the 3 required parameters with correct values;RequiresGPU
isforbidden
, thenGPUParams
should be an empty string (FIXME: not sure we should enforce it(?))Just as an example, here is how
GPUParams
required parameters would look like:and this is an example with the full set of parameters:
Describe alternatives you've considered Not to validate the optional parameters.
Additional context Major discussion happened here: https://github.com/cms-sw/cmssw/pull/33057 and description of these parameters: https://github.com/cms-sw/cmssw/pull/33057#issuecomment-810914270