cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

How to ensure consistency in the configuration of Heterogeneous Producers? #35516

Closed VinInn closed 2 months ago

VinInn commented 2 years ago

Heterogeneous Producers (aka Switch Producers ) may be composed of device-dependent producers sharing little common code. This can have a variety of reasons besides a missing "Heterogeneous Framework": 1) the algorithms used on different devices may be very different due to specific device optimization 2) the Host version matches in the Switch Producer the "SoA" importer.

Some notable examples of the kind of confusion/mess this may lead are: 1) a recent (aborted) attempt to change the configuration of the Pixel Clusterizer: #35506 2) the copy paste required to configure a modifier: https://github.com/mmasciov/cmssw/blob/cbbe7b2f13e0529868a632b44811de0e753e0562/HLTrigger/Configuration/python/customizeHLTforRun3Tracking.py#L29 3) the inconsistency in "ChannelThreshold" in the PixelClustrizer in HLT between CPU and GPU https://hypernews.cern.ch/HyperNews/CMS/get/pixelOfflineSW/1587/1/2.html

In my opinion we should find a way to configure Heterogeneous Producers from a single set of values

cmsbuild commented 2 years ago

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

assign core, heterogeneous

cmsbuild commented 2 years ago

New categories assigned: heterogeneous,core

@Dr15Jones,@smuzaffar,@fwyzard,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard commented 2 years ago

Ciao Vincenzo, I see two kinds of problems here:

I've given some thought about solving the first case, but I'm afraid the second case needs to be handled ad hoc case by case.

I'll add some comments about the first case.

makortel commented 2 years ago

As of today additional PSets can be used to hold and propagate common parameters, although with the way that HLT is configured in ConfDB (or what I recall from that) a refToPSet_ would need to be used, that would likely imply changes in how the parameters are specified in the relevant modules.

In the longer term with a portability technology, where the behavior of an EDModule for each backend is supposed to be exactly the same, my vision is along (names need to improved etc)

# maps to a "set" of EDProducers along
# actual module label <-> actual C++ type
# fooDevice@cpu <-> cpu::FooProducerDevice
# fooDevice@cuda <-> cuda::FooProducerDevice
# etc
# PortableInputTag is transformed according the to the backend, i.e.
# for @cpu : someInputOnDevice@cpu
# for @cuda : someInputOnDevice@cuda
fooDevice = PortableEDProducer("FooProducerDevice",
   parameter = ...,
   srcFromHost = cms.InputTag("someInput"),
   srcFromDevice = PortableInputTag("someInputOnDevice"),
   ...
)

# Maps to a SwitchProducer where actual device
# cases use a <device>::FooDeviceToSoA EDProducer
# (e.g. cuda::FooDeviceToSoA), and host cases set up an EDAlias
# If FooDeviceToSoA transfers only a subset of the EDProducts produced by fooDevice,
# would need some additional syntax to restrict the EDAlias
fooSoA = PortableToHost("FooDeviceToSoA",
   src = PortableInputTag("fooDevice")
)

In principle something like that could be devised for the current use of CUDA. Given that this would require some changes in the structure of the CUDA EDProducers and their host counterparts, I'm not sure if this would be worth of the effort assuming we could get to deploy a portability technology during next year. On the other hand, reorganizing the CUDA code to that direction would reduce some of the reorganization work (and risks) from the switch to a portability technology.

fwyzard commented 2 years ago

hi Matti, taking it a step further, we could simply the configuration even more if we used a single EDProduct to hold multiple optional copies of a SoA-based data strucure:

With this approach, we could

What do you think ?

VinInn commented 2 years ago

a sort of portable datatype already exists https://cmssdt.cern.ch/dxr/CMSSW/source/CUDADataFormats/Common/interface/HeterogeneousSoA.h#12 he knows how to copy himself adding isHost isDevice isGPU or whoAmI that returns an enum is trivial... but Ok, this is for oversimplified data objects.

VinInn commented 2 years ago

I think that "getting rid of the explicit "from CUDA" and "to CUDA" modules" is something we should move to higher priority. I agree that if the CPU-module is Legacy the only solution to guarantee a single configuration is that the code is properly modified to have a single "driver-producer" (so Legacy-algo is not anymore strictly speaking "Legacy") that can fit in an Andrea-like solution

fwyzard commented 2 years ago

Yes, that is what I'm considering as a good starting point.

I would like to figure out wether to

And I would like to

makortel commented 2 years ago

Hi Andrea,

we could simply the configuration even more if we used a single EDProduct to hold multiple optional copies of a SoA-based data strucure:

We tried that in the early days and the attempt at that time failed https://indico.cern.ch/event/746161/#18-evolution-of-the-heterogene (slide 4 in particular). I'm not saying it could not be done (I hope we're wiser now anyway), but at least we should not repeat the same mistakes.

  • each "heterogeneous" or "portable" (I do like the name!) variant (cpu, cuda, etc) produces the same type
  • this type has a map (or other similar solution) to keep track of multiple copies of the output data: on the host, and each back-end, and maybe also on each device of a given back-end;

I'm concerned of the coupling all device technologies (CUDA, ROCm, etc) in a single place. If done, such coupling would be best to be very weak. E.g. a downstream consumer should be able to access only the "host memory" product without the package of consuming module to have any dependence on any of the possible device technologies.

I'd think persistence to be another challenge. I'd think the copy in "host memory" to be the only part that makes sense to persist (so the product would need some special treatment). Any process should be able to read a persisted product regardless of the set of devices it has wrt. the devices the producing process had.

  • the framework knows that if a module "consumes" this type on a different device, it should schedule an asynchronous copy;
  • either the framework, or a central facility, or the itself, know how to copy the data from the host to each back-end, and vice versa; optionally also how to copy across different devices.

This would imply that the core framework would need gain a concept of "memory space" that would at minimum imply an EDModule to need to declare the "memory space" in its consumes declaration.

A somewhat natural place to know how to transfer the data of a product from one memory space to another would be the product itself (similar to "post-insert action" or "merging of Run products"). If done in that way one should be again very careful with coupling.

The core framework should stay independent of the technology (and product) specific details. A "central facility" would also have to deal with the concern of coupling, and a question of how the this knowledge of "copying a specific product from one memory space to another" would be registered to it.

In principle all of these are probably doable, and maybe some of them are the direction we should eventually go in the long term, but I'm concerned that extending the functionality of the core framework at this point will take time, and we'd need to be careful to identify the general patterns that would make sense to implement there. So far, I think, the approach of working on top of the core framework has given us agility to try out different designs with the price of some inconvenience. I'm not sure now would be the right time to abandon that approach.

One implication the "single heterogeneous product" approach would have is that each EDModule with multiple versions would need to be declared as a SwitchProducer in the configuration. In my example in https://github.com/cms-sw/cmssw/issues/35516#issuecomment-933477923 this would mean something along

# Maps to a SwitchProducer where actual device
# cases use a <device>::FooProducerDevice
fooDevice = PortableEDProducer("FooProducerDevice",
   parameter = ...,
   srcFromHost = cms.InputTag("someInput"),
   srcFromDevice = cms.InputTag("someInputOnDevice"),
)

consumerOfFoo = PortableEDProducer("FooConsumerDevice",
   src = cms.InputTag("fooDevice")
)

A further implication coming to my mind from this is that it would no longer be guaranteed that consuming a product with label fooDevice@cuda/fooDevice@cpu would imply that the whole chain of EDProducers up to fooDevice would be run on CUDA/CPU. E.g. in the example above, a consumer asking for fooDevice@cpu would cause the cpu::FooProducerDevice to run regardless of what the SwitchProducer says, but that consuming someInputOnDevice would imply the SwitchProducer to decide which of the versions to run, and the product would then be transferred to CPU if a non-CPU version of the producer was run.

makortel commented 2 years ago

After thinking this a bit more, I think a fundamental question here is "what does it mean if a consumer asks for a product foo on device D?" Does it mean that

  1. a producer to produce foo in D is run (as currently), or
  2. a producer to produce foo in the "best available device" is run, and then T is transferred from that device to D (if D is not on that device)

In a system with 1. transfers between devices need to be declared explicitly in the configuration (in one way or another) and the transferred-to product has a different module label. But user knows for sure that when asking for a product on a specific device that specific producer is run, and that there are no implicit transfers (i.e. if the DAG up to that producer has transfers, they are explicitly in the DAG). In short it is inconvenient to set up the transfers, but it is very easy to run both CPU and device sub-DAGs in the same process.

In a system with 2. transfers between devices are implicit (ignoring the details of how the transfers are actually done) and the transferred-to products have the same module label. If user asks for a product on a specific device, depending on the circumstances either the producer on that device is run, or in other device and the product is transferred. In short the transfers do not need any additional setup, but running sub-DAGs on CPU and device in the same process requires setting up these sub-DAGs explicitly.

makortel commented 2 years ago

@Dr15Jones brought up an option of using simply foo = cms.EDProducer("FooProducerDevice", ...) and use the plugin manager to load the EDModule of the intended namespace (there would need to be configuration options to set that globally and per-module, ref https://github.com/cms-sw/cmssw/issues/31760). In this option framework would implicitly transfer the data products from device to host (effectively by calling a user-provided function to do that). We're still in early stage of figuring out the details and implications though.

fwyzard commented 3 months ago

I think this is addressed by the process.foo = cms.EDProducer("FooProducerDevice@alpaka", ...) approach.

fwyzard commented 3 months ago

+heterogeneous

cmsbuild commented 3 months ago

cms-bot internal usage

makortel commented 2 months ago

I think this is addressed by the process.foo = cms.EDProducer("FooProducerDevice@alpaka", ...) approach.

I agree.

makortel commented 2 months ago

+core

makortel commented 2 months ago

@cmsbuild, please close

cmsbuild commented 2 months ago

This issue is fully signed and ready to be closed.