Automatic switching between CPU and GPU for DQMEDAnalyzer

sroychow commented 3 years ago

From Tracker DQM side, we are developing DQM modules to monitor hlt products(e.g. pixel Tracks, Vertices in SoA) which can either be produced on a GPU or a CPU. Right now in out tests, the same module is modified with the gpu modifier to use the correct product in a GPU wf. Example is :-

monitorpixelTrackSoA = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA@cpu"),
                                        TopFolderName = cms.string("SiPixelHeterogeneous/PixelTrackSoA"),
)

gpu.toModify(monitorpixelTrackSoA,
    pixelTrackSrc = cms.InputTag("pixelTracksSoA@cuda")
)

Given that we want to run this at HLT, I wanted to understand if we can have a SwitchProducer mechanism for DQMEDAnalyzer so that we can do something like this:-

monitorpixelTrackSoA =  SwitchProducerCUDA(
    cpu =DQMEDAnalyzer(.....
                            pixelTrackSrc = cms.InputTag("pixelTracksSoA@cpu"),
            )
    cuda = DQMEDAnalyzer(.....
                             pixelTrackSrc = cms.InputTag("pixelTracksSoA@cuda"),
                ),
)

Can FrameWork experts give some guidance on this? @arossi83 @mmusich @tsusa @connorpa

cmsbuild commented 3 years ago

A new Issue was created by @sroychow Suvankar Roy Chowdhury.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 3 years ago

assign dqm, heterogeneous

cmsbuild commented 3 years ago

New categories assigned: heterogeneous,dqm

@jfernan2,@ahmad3213,@rvenditti,@fwyzard,@emanueleusai,@makortel,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard commented 3 years ago

@mmusich currently I don't think that is possible.

The reason is that the SwitchProducer does not run anything by itself, it simply "aliases" one of its branches to its name; the "alias" is then run when its products are requested by some other module.

Since an EDAnalyzer does not produce anything, there would be nothing triggering its execution.

You should be able to achieve the same effect simply with

monitorpixelTrackSoA = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA"),
                                        TopFolderName = cms.string("SiPixelHeterogeneous/PixelTrackSoA"),
)

The pixelTracksSoA SwitchProducer will pick pixelTracksSoA@cpu or pixelTracksSoA@cuda automatically.

However, with this approach how do you disentangle what has been running on the CPU and on the GPU ?

mmusich commented 3 years ago

@fwyzard aren't all DQM modules EDProducers: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/README.md ?

fwyzard commented 3 years ago

Ah, good point, I'm stuck to the pre-transition approach based on EDAnalyzers.

mmusich commented 3 years ago

However, with this approach how do you disentangle what has been running on the CPU and on the GPU ?

It shouldn't but perhaps we still haven't got the gist of what is requested. I thought we would be submitting relvals with and without the gpu modifier and compare the products to validate that CPU SoA-based reco gives same results as GPU one.

fwyzard commented 3 years ago

Then this

monitorpixelTrackSoA = SwitchProducerCUDA(
    cpu = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA@cpu"),
                                        TopFolderName = cms.string("SiPixelHeterogeneous/PixelTrackSoA")
    ),
    cuda = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA@cuda"),
                                        TopFolderName = cms.string("SiPixelHeterogeneous/PixelTrackSoA")
    )
)

should work in principle.

Still

why would you prefer this to just reading the pixelTracksSoA collection ?
when looking at the plots, how do you disentagle what has been produced on GPU vs CPU ?

fwyzard commented 3 years ago

It shouldn't but perhaps we still haven't got the gist of what is requested.

Sorry, you wrote

Given that we want to run this at HLT

so I assumed you meant inside the HLT running online.

mmusich commented 3 years ago

Sorry, you wrote

first, I didn't wrote it, I am not the author of the issue :)

so I assumed you meant inside the HLT running online.

No, we're trying to address requests from PPD/ TSG about CPU/GPU validation.

fwyzard commented 3 years ago

Sorry, you wrote

first, I didn't wrote it, I am not the author of the issue :)

Wops, sorry ...

so I assumed you meant inside the HLT running online.

No, we're trying to address requests from PPD/ TSG about CPU/GPU validation.

Ah, OK. I'm afraid I don't know what is the request, I'll have to understand that first.

makortel commented 3 years ago

No, we're trying to address requests from PPD/ TSG about CPU/GPU validation.

Ah, OK. I'm afraid I don't know what is the request, I'll have to understand that first.

I think too that the details of what exactly is being requested would be crucial to figure out the best course of action. In the leading order,

why would you prefer this to just reading the pixelTracksSoA collection ?

just reading the pixelTracksSoA should be the way to go. That is an EDAlias to either pixelTracksSoA@cpu or pixelTracksSoA@cuda depending which one was triggered to be run by the SwitchProducerCUDA.

VinInn commented 3 years ago

SiPixelPhase1MonitorTrackSoA will anyhow run on host so it shall just consume pixelTracksSoA (or hltPixelTracksSoA) and will get whatever produced for that event. (And given that this is done in the same process, and the decision to run on GPU or CPU at the moment is taken at process level, should not be difficult to flag it cpu or gpu

VinInn commented 3 years ago

I suspect that this should work

monitorpixelTrackSoA = SwitchProducerCUDA(
    cpu = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA"),
                                        TopFolderName = cms.string("SiPixelHeterogeneousOnCPU/PixelTrackSoA")
    ),
    cuda = DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA',
                                        pixelTrackSrc = cms.InputTag("pixelTracksSoA"),
                                        TopFolderName = cms.string("SiPixelHeterogeneousOnGPU/PixelTrackSoA")
    )
)

cuda does not mean: run on cuda

thomreis commented 2 years ago

In ECAL we have been looking at similar things recently and tried the SwitchProducerCUDA as well to change DQM configurations. What we had issues with was when a downstream config file wanted to change a parameter regardless of if the cpu or cuda branch were used. For example if in the example some config file would modify pixelTrackSrc to a new input tag. In that case monitorpixelTrackSoA.pixelTrackSrc = cms.InputTag("otherPixelTracksSoA") does not work since monitorpixelTrackSoA is a SwitchProducerCUDA and not the DQMEDAnalyzer('SiPixelPhase1MonitorTrackSoA') that the downstream configuration might expect it to be. Is there a way to achieve this without explicitly pointing to the cpu or cuda cases?

sroychow commented 2 years ago

@thomreis I am not sure I understand your issue very well. Commenting from my experience, for modules aimed at monitoring collections, you should use the same tag for a GPU and a CPU workflow(e.g. pixelTrackSoA). Please have a look at the comment above from Matti. Coming to the modules which will do comparisons between collections produced on the GPU or CPU, you should ask for two collections(e.g. pixelTracksSoA@cpu and pixelTracksSoA@cuda). The workflow should be setup in such a way that both collections are available.

thomreis commented 2 years ago

Hi @sroychow that was in fact what I meant when I said I want to change a parameter regardless if the CPU or GPU case of the SwitchProducer is used. In my example the SwitchProducer would be defined in some cff file with the InputTag for both, the cpu and cuda, cases being pixelTrackSoA. That cff is then loaded in a cfg file where then the InputTag for monitorpixelTrackSoA is changed to a different one (e.g. otherPixelTracksSoA). The change in the cfg file should be happening for whichever of the two cases in the SwitchProducer is used without explicitly naming one or both.

makortel commented 2 years ago

Is there a way to achieve this without explicitly pointing to the cpu or cuda cases?

Any later customizations of SwitchProducers must use the cases explicitly. But you can make a loop along

s = SwitchProducerCUDA(cpu=..., cuda=...)
for case in s.parameterNames_():
    getattr(s, case).parameter = "new value"

makortel commented 2 years ago

I'd actually like to understand better the use cases for different code/configuration for CPU and CUDA in DQM. The SwitchProducer-based infrastructure assumes that the downstream consumers do not care where exactly their input data products were produced. For example, if the producer and consumer were in different jobs, this SwitchProducer-based approach would not work in general.

Could you tell more what exactly you intend to be different for the CPU vs GPU -produced input data? (e.g. https://github.com/cms-sw/cmssw/issues/35879#issuecomment-954847378 suggests for different folders for histograms)

jfernan2 commented 2 years ago

Could you tell more what exactly you intend to be different for the CPU vs GPU -produced input data? (e.g. #35879 (comment) suggests for different folders for histograms)

@makortel if my understanding is correct, the different folders are for the output of the histograms in the DQM root file coming from the DQM modules, in order to distinguish the monitoring of the CPU vs GPU collections . But nothing different is expected from the input collections.

thomreis commented 2 years ago

For ECAL we want to make event-by-event CPU vs. GPU comparisons plots. That requires both input collections but that part of the DQM module should only run on GPU machines (and only on a subset of events obviously because otherwise there would be no point in reconstructing on GPUs in the first place). So the current idea is that on a CPU-only machine the default ECAL DQM would run and on a GPU machine the default ECAL DQM plus the CPU vs. GPU comparison task. The ECAL DQM uses workers to do the different tasks and for the GPU comparison an additional worker would be added to the list of workers to be run (https://github.com/cms-sw/cmssw/pull/35946/files#diff-a56670e09d76281c92bd7bd09a0316c5db75f7e23ea20c7a67b1ddabb2bd4dd8R33). So in the end, on a GPU machine we would need to modify the configuration of the ecalMonitorTask module.

One thing we tried was a cff file with the following but with that we had the issue I described earlier:

import FWCore.ParameterSet.Config as cms                                                                              

from HeterogeneousCore.CUDACore.SwitchProducerCUDA import SwitchProducerCUDA                                          
from DQM.EcalMonitorTasks.EcalMonitorTask_cfi import ecalMonitorTask as _ecalMonitorTask                              
ecalMonitorTask = SwitchProducerCUDA(                                                                                 
    cpu = _ecalMonitorTask.clone()                                                                                    
)

# Customization to run the CPU vs GPU comparison task if the job runs on a GPU enabled machine                        
from Configuration.ProcessModifiers.gpu_cff import gpu                                                                
from DQM.EcalMonitorTasks.GpuTask_cfi import ecalGpuTask                                                              

gpu.toModify(ecalMonitorTask,                                                                                         
    cuda = _ecalMonitorTask.clone(workerParameters = dict(GpuTask = ecalGpuTask)))                                    
gpu.toModify(ecalMonitorTask.cuda.workers, func = lambda workers: workers.append("GpuTask"))

jfernan2 commented 2 years ago

I would also like to point out this nice talk by A. Bocci about GPU in DQM listing all the possibilities: https://indico.cern.ch/event/975162/contributions/4106441/

fwyzard commented 2 years ago

disclaimer n. 1: I've typed all this directly on GitHub without testing any of it - hopefully I didn't make many mistakes, but don't expect this to be 100% error-free

disclaimer n. 2: names are not my forte; if you find better names for what I suggest, please, go ahead with them !

Background

I think that the complexity here comes from the fact that we want to have a single workflow configuration that does different things (two different set of validation plots) depending if a GPU is available or not:

produce only the traditional validation plots on a machine without GPUs;
produce both the traditional validation plots, and the GPU-vs-CPU comparison plots on a machine with GPUs.

IMHO this is not something that should be handled "automatically" by the presence or absence of a GPU, but at the level of the definition of the workflow. So, we should have two workflows

1.) a workflow that runs the traditional validation
2.) a workflow that runs the traditional validation and the GPU-vs-CPU comparison

Then we could (try to) run each workflow on a different machine:

A.) a machine without GPUs
B.) a machine with GPUs

Then

workflow 1. on machine A. would perform the traditional validation of the products reconstructed on CPUs;
workflow 1. on machine B. could perform the traditional validation () of the products reconstructed on CPUs or GPUs ();
workflow 2. on machine A would fail;
workflow 2. on machine B would perform the traditional validation of the products reconstructed on CPUs, and the GPU-vs-CPU comparison.

(*) depending what we think should be the behaviour of the workflow

The bottom line is, I would now try to find a technical solution for this problem, because it should have a different definition altogether. So I would suggest to disentangle the two things: running on GPU, and doing the GPU-vs-CPU validation.

The current behaviour in the cmsDriver workflows is

if the gpu modifier is not given, nothing runs on a GPU - even if one is available;
if the gpu modifier is given, make use of a GPU if one is available, fall back to CPU otherwise.

So, running without the gpu modifier and running with the gpu modifier on a machine without GPUs should run the exact same modules and configuration. I would suggest to keep things like that.

Then we can add a second modifier (e.g. gpu_validation) to ask the DQM modules to read both CPU-build and GPU-built collections explicitly, and make any relevant comparisons.

Let me try to give some made-up examples...

Make DQM plots of some reconstructed quantities

Let's say the original configuration was

monitorStuff = DQMEDAnalyzer('MonitorStuff',
    src = cms.InputTag('someStuff'),
    folder = cms.string('DQM/Folder/Stuff')
)

stuffValidationTask = cms.Task(monitorStuff)

Once GPUs are involved, we have three options

we want the results of the CPU or GPU reconstruction to be monitored in the same folder, so they can be compared across different workflows; if we merge the results of jobs running on CPU and GPU we will have a single folder, and the monitoring will not tell us anything about GPU-vs-CPU;
we want the results of the CPU or GPU reconstruction to be monitored in different folders, so we now immediately what we are looking at; if we merge the results of jobs running on CPU and GPU, we will have two separate folders;
like 2., but we want to monitor both the CPU and the GPU products in a single job.

Assuming that someStuff is the result of a SwitchProducer, 1. is easy: we don't have to do anything, just use the monitorStuff module as is.

To achieve 2. we have two options. If someStuff is the result of a SwitchProducer, this should already do the right thing:

monitorStuff = SwitchProducerCUDA(
    cpu = DQMEDAnalyzer('MonitorStuff',
        src = cms.InputTag('someStuff'),
        folder = cms.string('DQM/Folder/Stuff')
    ),
    cuda = DQMEDAnalyzer('MonitorStuff',
        src = cms.InputTag('someStuff'),
        folder = cms.string('DQM/Folder/StuffOnGPU')
    )
)

I would be great if somebody could actually test it and let us know if it works :-)

If the collections being monitored are not from a SwitchProducer, or if we just want to make things more explicit, different InputTags can be used:

monitorStuff = SwitchProducerCUDA(
    cpu = DQMEDAnalyzer('MonitorStuff',
        src = cms.InputTag('someStuffOnCPU'),  # or 'someStuff@cpu'
        folder = cms.string('DQM/Folder/Stuff')
    ),
    cuda = DQMEDAnalyzer('MonitorStuff',
        src = cms.InputTag('someStuffOnGPU'),  # or 'someStuff@cuda'
        folder = cms.string('DQM/Folder/StuffOnGPU')
    )
)

Finally, 3. is just a variation of the last option:

monitorStuff = DQMEDAnalyzer('MonitorStuff',
    src = cms.InputTag('someStuff@cpu'),
    folder = cms.string('DQM/Folder/Stuff')
)
monitorStuffOnGPU = DQMEDAnalyzer('MonitorStuff',
    src = cms.InputTag('someStuff@cuda'),
    folder = cms.string('DQM/Folder/StuffOnGPU')
)

The configuration for 2. (either options, though the first one is simpler) or 3. can be generated from the configuration of 1. with an appropriate modifier.

For 2. (first):

_monitorStuff = monitorStuff.clone()

monitorStuff = SwitchProducerCUDA(
    cpu = _monitorStuff.clone()
)

gpu_validation.toModify(monitorStuff, 
    cuda = _monitorStuff.clone(
        src = 'someStuff',
        folder = 'DQM/Folder/StuffOnGPU'
    )
)

While for 3. a new module needs to be added to a Task or Sequence:

gpu_validation.toModify(monitorStuff, 
    src = 'someStuff@cpu'
)

monitorStuffOnGPU = monitorStuff.clone(
    src = 'someStuff@cuda',
    folder = 'DQM/Folder/StuffOnGPU'
)

_stuffValidationTask_gpu = stuffValidationTask.copy()
_stuffValidationTask_gpu.add(monitorStuffOnGPU)
gpu_validation.toReplaceWith(stuffValidationTask, _stuffValidationTask_gpu)

Make DQM plots of GPU-vs-CPU reconstructed quantities

If we have a single DQM module that can do both the traditional validation, and the GPU-vs-CPU comparison, we have few options.

The configuration for performing only the traditional validation could be:

monitorAndCompareStuff = DQMEDAnalyzer("MonitorAndCompareStuff",
    reference = cms.InputTag('someStuff'),
    target = cms.InputTag('')  # leave empty not to do any comparison
)

As in the previous example, if someStuff is the result of a SwitchProducer, this will validate the CPU or the GPU version of the reconstruction and put the results in a single folder. To use different folders we can adapt the previous solutions.

The configuration for performing the traditional validation and the GPU-vs-CPU comparison could be

monitorAndCompareStuff = DQMEDAnalyzer("MonitorAndCompareStuff",
    reference = cms.InputTag('someStuff@cpu'),
    target = cms.InputTag('someStuff@cuda')
)

Whether the target collection is used only for the comparison, or also for (a subset of) the traditional validation, is up to the DQM module itself. The same applies the the folder being used for the plots; for example, additional folders could be configured via python, or they could be hardcoded in C++, etc.

Also in this case, the second configuration could be generated starting from the first by an appropriate modifier:

gpu_validation.toModify(monitorAndCompareStuff,
    reference = 'someStuff@cpu',
    target = 'someStuff@cuda'
)

thomreis commented 2 years ago

I have tried to implement something along Andrea's comments (https://github.com/cms-sw/cmssw/compare/master...thomreis:ecal-dqm-addGpuTask?expand=1), based on GPU vs. CPU comparison code from @alejands in PR #35946, but the matrix tests mostly fail with an exception (running on lxplus without GPU):

----- Begin Fatal Exception 25-Nov-2021 20:39:43 CET-----------------------
An exception of category 'UnimplementedFeature' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=EcalDQMonitorTask label='ecalMonitorTask@cpu'
Exception Message:
SwitchProducer does not support non-event branches. Got Run for SwitchProducer with label ecalMonitorTask whose chosen case is ecalMonitorTask@cpu.
----- End Fatal Exception -------------------------------------------------

I am not quite sure what that means and if there would be a way to change EcaDQMonitorTask to be compliant (if it is actually EcaDQMonitorTask that causes this).

Note that if the gpu_validation modifier will be given only for GPU machines what we want could probably be done without using a SwitchProducerCUDA.

fwyzard commented 2 years ago

Hi Thomas, thanks for the test.

Can I ask what behaviour you are trying to achieve ?

Anyway, it looks like we (currently) cannot use the SwitchProducer for a DQMEDAnalyzer.

Matti, is this something you think should be added ? Or do we look for a different solution ?

thomreis commented 2 years ago

If the gpu-validation modifier not given only the normal ECAL DQM tasks should run. If gpu-validation is given and there is no GPU then also run the normal ECAL DQM tasks. If gpu-validation is given and there is a GPU then run the normal ECAL DQM tasks and also the ECAL GPU vs. CPU comparison. Since the comparison consumes the @cpu and the @cuda inputTags it should force the framework to execute the CPU and the GPU algorithms.

Of course there is in principle no need to give the gpu-validation modifier if there is no GPU for which the results need to be validated, so as mentioned before I think the SwitchProduceris not really needed in this case ifgpu-validation` is only given on GPU machines (manually or by some other mechanism of the DQM deployment).

Could you elaborate a bit more what the DQMEDAnalyzer does that prevents its use within a SwitchProducer?

fwyzard commented 2 years ago

If gpu-validation is given and there is no GPU then also run the normal ECAL DQM tasks.

IMHO this should actually crash: you are explicitly asking to run something on GPU when one is not present.

Could you elaborate a bit more what the DQMEDAnalyzer does that prevents its use within a SwitchProducer?

It looks like the SwitchProducer works for event data products, and not for lumisection or run data products. It looks like the DQMEDAnalyzer is an EDProducer of the latter type, producing only lumisection or run data products.

In principle - with the current approach where the "branch" choseb by the SwitchProducer is done once and for all at the beginning of the job - it should be possible to make the SwitchProducer work also for lumisection and run data products. However, if in the future we plan to make it possible to switch event-by-event, this would probably break.

thomreis commented 2 years ago

I see. So the edm::Transition::EndRun for the produce() is a problem in this case. https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/interface/DQMEDAnalyzer.h#L57

IMHO this should actually crash: you are explicitly asking to run something on GPU when one is not present.

If gpu-validation is only given on GPU machines then I would actually drop the SwitchProducer and just use toModify to add the additional task to the module configuration. So I guess in that case it would crash if the modifier is given when there is no GPU.

fwyzard commented 2 years ago

I would actually drop the SwitchProducer and just use toModify to add the additional task to the module configuration. So I guess in that case it would crash if the modifier is given when there is no GPU.

👍

makortel commented 2 years ago

So I would suggest to disentangle the two things: running on GPU, and doing the GPU-vs-CPU validation.

I fully agree.

Matti, is this something you think should be added ? Or do we look for a different solution ?

It seems to me that all the presented use cases so far are really about knowledge of whether the data product was produced on CPU or GPU. The SwitchProducer feels suboptimal solution for that (e.g. won't work in general across jobs). So I would think of a different solution (possibly based on provenance).

rappoccio commented 2 years ago

Hi, @makortel I'm not sure what the action plan is to get this fixed, should we have a discussion or is it not necessary?

makortel commented 2 years ago

I understood @thomreis found a different solution for his use case ("Make DQM plots of GPU-vs-CPU reconstructed quantities" in Andrea's https://github.com/cms-sw/cmssw/issues/35879#issuecomment-979200772).

The exact use case of the issue description (Andrea's option 1, "same folder for CPU and GPU quantities", in https://github.com/cms-sw/cmssw/issues/35879#issuecomment-979200772) works out of the box.

For the use case of Vincenzo in https://github.com/cms-sw/cmssw/issues/35879#issuecomment-954847378 (Andrea's option 2, "different folder for CPU and GPU quantities, fill only one of those in a job", in https://github.com/cms-sw/cmssw/issues/35879#issuecomment-979200772), we are going to implement something like Provenance telling if a data product was produced on a CPU or a GPU (actual implementation will likely be different, but I hope this gives the idea).

Andrea's option 3, "different folder for CPU and GPU quantities, fill both in a job", in https://github.com/cms-sw/cmssw/issues/35879#issuecomment-979200772 would be best implemented with a specific Modifier (as Andrea wrote).

makortel commented 2 years ago

For the use case of Vincenzo in #35879 (comment) (Andrea's option 2, "different folder for CPU and GPU quantities, fill only one of those in a job", in #35879 (comment)), we are going to implement something like Provenance telling if a data product was produced on a CPU or a GPU (actual implementation will likely be different, but I hope this gives the idea).

Just to add that the this use case can technically be implemented already today by using the information stored in event provenance. For example, for an event product pixelTracksSoA "produced" by SwitchProducerCUDA the provenance shows that the parent of the product has a label of either pixelTracksSoA@cpu or pixelTracksSoA@cuda, that could be used to distinguish where it was produced. Although this works only if the cpu/cuda case in the SwitchProducer is an EDProducer, if it is an EDAlias, the parent of the pixelTrackSoA points to the actual EDProducer that produced the aliased-for product, that, in general, can have any module label. The same is true also if one wants to inspect any further parent of the pixelTracksSoA product, and one basically has to know the EDProducer C++ types to know what happened.

While it can be done, this model doesn't scale well for many uses or evolving configuration. Therefore we're planning to introduce a simpler record at process level along "whether GPU offloading was enabled or not". Some more details are in #30044 (where any feedback on that approach would be welcome).

cms-sw / cmssw