dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Add support for GPU parameters at ReReco spec level #10388

Closed amaltaro closed 3 years ago

amaltaro commented 3 years ago

Impact of the new feature ReqMgr2

Is your feature request related to a problem? Please describe. It's a new project to support GPU processing in central production workflows, where WM system will be the bridge between Offline/CMSSW and the grid resources made available to CMS (through the glideinWMS layer).

There will be a series of such tickets, such that we can try to break work down in even parts.

Describe the solution you'd like In short, we need to support new workflow spec attributes for the ReReco/DataProcessing spec modules.

Now expanding on the important details:

Just as an example, here is how GPUParams required parameters would look like:

'{"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"]}'

and this is an example with the full set of parameters:

'{"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"], "GPUName": "Tesla T4", "CUDADriverVersion": "460.32.03", "CUDARuntimeVersion": "11.2.152"}'

Describe alternatives you've considered Not to validate the optional parameters.

Additional context Major discussion happened here: https://github.com/cms-sw/cmssw/pull/33057 and description of these parameters: https://github.com/cms-sw/cmssw/pull/33057#issuecomment-810914270

justinasr commented 3 years ago

Hi, Small technical question about GPUParams string/dict - does ReqMgr2 already expects a string from us or it should be a dict that you can validate parameter by parameter and then stringify it somewhere on your side? Types: all values in GPUParams are strings except for memory which is int and CUDACapabilities is list of strings, right? I am double checking because your example has floats.

mrceyhun commented 3 years ago

As a complement to Alan's comment, I would like to say that CUDACapabilities should be string.

amaltaro commented 3 years ago

Hi @justinasr, it's good to make sure we are talking about the same things. On the McM side, you would provide a string (JSON object) to the GPUParams parameter, so McM could do something like:

In [15]: GPUParams = {"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"]}
In [16]: GPUParams = json.dumps(GPUParams)

and inject it into ReqMgr2.

On the ReqMgr2 side, we would have to load that string (JSON object) as a python dictionary, so:

In [18]: gpuParams = json.loads(GPUParams)

and then the whole validation would happen on top of this python dictionary. As we discussed in the past, such approach is less memory hungry than a python dictionary, so we better store it as such in central CouchDB.

Types: all values in GPUParams are strings except for memory which is int and CUDACapabilities is list of strings, right?

That's correct. I have fixed the data structure example in the initial description. I also added an example with the full set of parameters.

As a complement to Alan's comment, I would like to say that CUDACapabilities should be string.

Thanks Ceyhun, I have also fixed the CUDACapability vs CUDACapabilities inconsistency in the initial description.

amaltaro commented 3 years ago

And here is a ticket to address the job creation and submission to HTCondor: https://github.com/dmwm/WMCore/issues/10393

fwyzard commented 3 years ago

@amaltaro are these parameters

GPUMemory
CUDACapabilities
CUDARuntime
GPUName
CUDADriverVersion
CUDARuntimeVersion

the ones that will be published by the nodes, or the ones that will be requested by the jobs ?

amaltaro commented 3 years ago

@fwyzard Andrea, these are the parameters to be published by the jobs, so the actual GPU job requirements. The job matchmaking expression - containing the node parameters - is defined by the Submission Infrastructure team, they might clarify what those are in the other issue #10393

fwyzard commented 3 years ago

Hi Alan - OK, thanks, then the list looks correct, as far as I can say.

By the way, I've posted a brief description of the fields here: https://github.com/cms-sw/cmssw/pull/33057#issuecomment-810914270 .

amaltaro commented 3 years ago

Thanks for providing a description for all those parameters, Andrea. I have linked them at the top description of this GH issue, but it might be better to expand those in a wiki under this project. Thanks again!

amaltaro commented 3 years ago

As an outcome of the GPU discussion we had today, I have just updated the initial description of this PR, mostly concerned with the RequiresGPU parameter with a different set of possible values (and data type different than the initial description).

srimanob commented 3 years ago

PR for runTheMatrix is updated, Master: https://github.com/cms-sw/cmssw/pull/33538 11_3: https://github.com/cms-sw/cmssw/pull/33057

Here is how it looks like to step/task that runs GPU.

'Task3': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
           'ConfigCacheID': 1042,
           'EventStreams': 0,
           'GPUParams': {'CUDACapabilities': ['7.5'],
                         'CUDADriverVersion': '',
                         'CUDARuntime': '11.2',
                         'CUDARuntimeVersion': '',
                         'GPUMemory': '8000',
                         'GPUName': ''},
           'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
           'InputFromOutputModule': u'FEVTDEBUGHLToutput',
           'InputTask': 'Digi_2021',
           'KeepOutput': True,
           'LumisPerJob': 10,
           'Memory': 3000,
           'Multicore': 1,
           'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
           'RequiresGPU': 'required',
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'Reco_Patatrack_PixelOnlyGPU_2021'},
srimanob commented 3 years ago

In addition, there is a comment on CUDARuntime in https://github.com/cms-sw/cmssw/pull/33057#discussion_r621161574

Since it matches with node's CUDACompatibleRuntimes, should we just change it to be the same? Or it will make the meaning off from what we consider.

fwyzard commented 3 years ago

I don't think we should change it:

amaltaro commented 3 years ago

@srimanob the GPUParams has to be provided as a json object (json.dumps). Please have a look at the discussion Justinas and I had in this issue.

Unfortunately we have not yet started working on this issue on the WMCore side, so we still do not have anything for you to test on. However, if your CMSSW pull requests can wait for a week more (maybe two...), we might be able to get the basic changes in ReqMgr2 such that you could test in testbed.

amaltaro commented 3 years ago

@justinasr @fwyzard @srimanob @mrceyhun Hi everyone, I just wanted to let you know that I'm starting to work on this issue, based on the initial specification in the first comment of this GH issue.

In short, new workflow/spec parameters are going to be:

Please let us know in case you see any inconsistency, or if there are new updates relevant to GPU support.

srimanob commented 3 years ago

Thanks @amaltaro I will re-add them to runTheMatrix.

amaltaro commented 3 years ago

@srimanob Phat, as you can see in the PR providing these features (#10799), the GPUMemory parameter has been renamed to GPUMemoryMB, such that it is very explicitly and clear which unit we are supposed to have there. If you see this change to be too problematic, we can discuss it during the day.

srimanob commented 3 years ago

@amaltaro GPUMemoryMB is fine for me. I've added the PR at https://github.com/cms-sw/cmssw/pull/35263. Could you please have a look? Thanks.