cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

An alpaka module with an explicit cpu backend fails to run in a job with a list of accelerators that does not include the cpu #43780

Open fwyzard opened 8 months ago

fwyzard commented 8 months ago

An @alpaka module with an explicit CPU backend such as

process.testProducerSerial = cms.EDProducer('TestAlpakaProducer@alpaka',
    size = cms.int32(99),
    alpaka = cms.untracked.PSet(
        backend = cms.untracked.string("serial_sync")
    )
)

will fail to run if the process is configured to exclude the CPU from the accelerators:

process.options.accelerators = [ 'gpu-nvidia' ]

with the message:

An exception of category 'UnavailableAccelerator' occurred while
   [0] Processing the python configuration file named writer.py
Exception Message:
Module testProducerSerial has the Alpaka backend set explicitly, but its accelerator is not available for the job because of the combination of the job configuration and accelerator availability on the machine. The following Alpaka backends are available for the job cuda_async.

Currently, the workaround is to use the alpaka_serial_sync:: variant explicitly:

process.testProducerSerial = cms.EDProducer('alpaka_serial_sync::TestAlpakaProducer',
    size = cms.int32(99)
)
cmsbuild commented 8 months ago

cms-bot internal usage

cmsbuild commented 8 months ago

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @makortel, @rappoccio, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 8 months ago

assign core, heterogeneous

cmsbuild commented 8 months ago

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 8 months ago

This comment is mostly to just think out loud. I want to find out if module.alpaka.backend = 'serial_sync' could be made to work for this case, or if that could cause any issues.

The process.options.accelerators specifies the set of accelerators that the job may use. I.e. with accelerators = ['gpu-nvidia', 'cpu'] the can run on a machine without a GPU, whereas accelerators = ['gpu-nvidia'] would lead to a failure.

The process.options.accelerators should drive the behavior of @alpaka module when the Alpaka backend is not explicitly specified.

Currently, the ProcessAcceleratorAlpaka (which plays python-side role in how the @alpaka-modules are handled) requires that also the explicitly set backends must be compatible with the process.options.accelerators.

In a way CPU is a special "accelerator" as it is always (assumed to be) present. And non-Alpaka code will use the CPU anyway. So perhaps just allowing explicitly-set host backends irrespective of the contents of process.options.accelerators would be ok.

If the previous case would be allowed, what about setting the backend explicitly to anything? For example, module.alpaka.backend = 'cuda_async' when accelerators = ['cpu'], should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".

fwyzard commented 8 months ago

what about setting the backend explicitly to anything? For example, module.alpaka.backend = 'cuda_async' when accelerators = ['cpu'], should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".

I agree that I think this should fail.

What about the case where a job uses alpaka_cuda_async::producer explicitly ? Should that fail as well ?

fwyzard commented 8 months ago

What about the case where a job uses alpaka_cuda_async::producer explicitly ? Should that fail as well ?

Actually, that fails because

----- Begin Fatal Exception 25-Jan-2024 01:25:05 CET-----------------------
An exception of category 'NotFound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'process_path'
   [2] Calling method for module alpaka_cuda_async::TestAlpakaProducer/'testProducer'
Exception Message:
Service Request unable to find requested service with compiler type name ' alpaka_cuda_async::AlpakaService'.
----- End Fatal Exception -------------------------------------------------
makortel commented 8 months ago

What about the case where a job uses alpaka_cuda_async::producer explicitly ? Should that fail as well ?

Actually, that fails because

I'm glad alpaka_cuda_async::producer alone fails. Theoreically user could still hack it to work with explicit process.add_(cms.Service('alpaka_cuda_async::AlpakaService')) and somehow removing ProcessAcceleratorAlpaka from the process. But I hope this level is something that would be caught in the code review (plus removal of ProcessAcceleratorAlpaka would break all modules relying on @alpaka suffix).