[alpakatest][RFC] Prototype evolution of EDModule API

makortel commented 2 years ago

This PR prototypes the Alpaka EDModule API, taking inspiration from https://github.com/cms-patatrack/pixeltrack-standalone/pull/224 and https://github.com/cms-patatrack/pixeltrack-standalone/pull/256. A major tested idea was to see how far the system could be implemented with just forward-declared Alpaka device, queue, and event types in order to minimize the set of source files that need to be compiled with the device compiler (I first crafted this prototype before the ALPAKA_HOST_ONLY macro).

The first commit extends the build rules by adding a new category of source files that need to be compiled for each Alpaka backend, but can be compiled with the host compiler. This functionality might be beneficial also on wider scope than this PR alone (so I could open a separate PR with only it). Here I took the approach of using a new file extension, .acc ("a" for e.g. "accelerated"), for the files that need to be compiled with the device compiler. The .cc files can be compiled with the host compiler. I'm not advocating for this particular choice as I'm not very fond of it, but I needed something to get on with the prototype.

I don't think we should apply this PR as is, but identify the constructs that would be useful, and pick those (and improve the rest).

One idea here was to hide the cms::alpakatools::Product<T> from users (having to explicitly interact with the ScopedContext to get the T is annoying). In addition, for CPU serial backend (synchronous, operates in regular host memory) the Product<T> wrapper is not used (because it is not really needed). In this way the downstream code could use the data products from Serial backend directly. For developers the setup would look like

The data products in the memory space of the current backend are consumed with edm::EDGetTokenT<T> and produced with edm::EDPutTokenT<T> (i.e. they look like normal products)
The data products from the "non-portable memory space" are consumed with edm::EDGetTokenT<edm::Host<T>> and produced to with edm::EDPutTokenT<edm::Host<T>>
- The edm::Host<T> is just a "tag", not an actual product type.

Internally this setup works such that for CPU Serial backend the edm::Host<...> part is ignored, and for other backends

The edm::EDGetTokenT<T> is mapped to edm::EDGetTokenT<edm::Product<T>>
The edm::EDGetTokenT<edm::Host<T>> is mapped to edm::EDGetTokenT<T>

For this setup to work an ALPAKA_ACCELERATOR_NAMESPACE::Event class is defined to be used in the EDModules instead of edm::Event. It wraps the edm::Event, and implements the aforementioned mapping logic (for getting and putting side) with a set of helper classes that are specialized for the backends. The ALPAKA_ACCELERATOR_NAMESPACE::EDProducer(ExternalWork) class implements the (reverse) mapping logic for the consumes() and produces() side.

The cms::alpakatools::Product<TQueue, T> is transformed into edm::Product<T> that can hold arbitrary metadata via type erasure (currently std::any for demonstration purposes). For Alpaka EDModules a ALPAKA_ACCELERATOR_NAMESPACE::ProductMetadata class is defined for the metadata purpose. This class(es) took also some of the functionality of ScopedContext that seems to work there better in this abstraction model (actually the kokkos version has similar structure here).

The ScopedContext class structure is completely reorganized, and is now completely hidden from the developers. There is now an ALPAKA_ACCELERATOR_NAMESPACE::impl::FwkContextBase base class for the common functionality between ED modules and ES modules (although the latter is not exercised in this prototype, so this is what I believe to be the common functionality). The ALPAKA_ACCELERATOR_NAMESPACE::EDContext class derives the FwkContextBase and adds ED specific functionality. I guess the FwkContextBase and EDContext could be implemented also as templates instead of placing them into ALPAKA_ACCELERATOR_NAMESPACE (they are hidden from developers anyway).

A third context class, ALPAKA_ACCELERATOR_NAMESPACE::Context, is defined to be given to the developers (via EDModule::produce() argument). It gives access to the Queue object. Internally it also signals to the FwkContextBase when the Queue has been asked by the developer, so that if the EDModule accesses its input products for the first time after that point, it won't try to re-use the Queue from the input product (because the initially assigned Queue is already being used). This Context class can be later extended e.g. along https://github.com/cms-patatrack/pixeltrack-standalone/pull/256.

One additional piece that would reduce the number of places where the edm::Host<T> would appear in user code, but is not prototyped here, would be automating the (mainly device-to-host) transfers. As long as the type T can be arbitrary, framework needs to be told how to transfer that type between two memory spaces (e.g. something along a plugin factory for functions), but at least these transfers would not have to expressed in the configuration anymore.

makortel commented 2 years ago

@fwyzard This is the prototype I mentioned earlier (and apparently failed to open in a draft mode...).

makortel commented 2 years ago

Rebased on top of master to fix conflicts in src/alpakatest/Makefile.

fwyzard commented 2 years ago

Could this be extended to better handle multiple backends with the same memory space ?

Currently we define a backend with

the memory space of the "accelerator" (host vs cuda vs rocm)
how the accelerator runs a kernel (e.g. CPU serial vs TBB)
how the host enqueues the work (blocking/sync vs non-blocking/async)

In principle we should have different execution options for the same memory space: cpu sync vs tbb sync, cuda sync vs cuda async, etc.

Do you think the approach researched here could be used to have a single data product (both in terms of dataformat type, and of underlying memory buffer/soa) shared among different execution cases ?

One concrete example would be having the CPU serial implementation for every module, and the TBB (serial) only for some modules where the extra parallelism makes sense.

makortel commented 2 years ago

Could this be extended to better handle multiple backends with the same memory space ? ... Do you think the approach researched here could be used to have a single data product (both in terms of dataformat type, and of underlying memory buffer/soa) shared among different execution cases ?

I think this approach would allow such an extension. There would certainly be many details to be worked out (like how to make the framework enough aware of memory and execution spaces, including supporting multiple devices of the same type, but in a generic way). But I'd expect the user-facing interfaces would stay mostly the same.

I have also the CUDA managed memory / SYCL shared memory in mind (for platforms that have a truly unified memory), in which case it would be nice if the downstream, alpaka-independent consumers could use directly the data product wrapped in edm::Product (as it is called here) after a proper synchronization. With edm::Product<T> class template being part of the framework we could peek in there (like with edm::View).

Of course, for any of this "using data products of one memory space in many backends" to work at all, the data product the EDProducer perceives to produce should be exactly the same type in all the backends for which this "sharing" is done (but IIUC you also wrote that).

For Serial/TBB backends using the same product types should, in principle, be trivial (and therefore the setup should be straightforward if the TBB backend uses a synchronous queue).

fwyzard commented 2 years ago

OK, so we are thinking about:

unified memory / shared memory: different "accelerators" (cpu vs gpu), in the same memory space (unified addressing space accessible from all devices), with different queue types (sync vs async);
serial execution vs internal parallelism with TBB: different "accelerators" (cpu serial vs cpu parallel), in the same memory space (host memory), with the same queue type (sync).

At lease for debugging, it might be useful to support also:

sync GPU queues, async CPU queues: a given "accelerator", with its given memory space, but with both queue types (sync and async).

I'm starting to see why alpaka keeps the three concepts almost orthogonal...

makortel commented 2 years ago

Made effectively obsolete by https://github.com/cms-sw/cmssw/pull/39428

cms-patatrack / pixeltrack-standalone

[alpakatest][RFC] Prototype evolution of EDModule API #314