Open fwyzard opened 9 months ago
cms-bot internal usage
A new Issue was created by @fwyzard Andrea Bocci.
@Dr15Jones, @makortel, @rappoccio, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core, heterogeneous
New categories assigned: core,heterogeneous
@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
This comment is mostly to just think out loud. I want to find out if module.alpaka.backend = 'serial_sync'
could be made to work for this case, or if that could cause any issues.
The process.options.accelerators
specifies the set of accelerators that the job may use. I.e. with accelerators = ['gpu-nvidia', 'cpu']
the can run on a machine without a GPU, whereas accelerators = ['gpu-nvidia']
would lead to a failure.
The process.options.accelerators
should drive the behavior of @alpaka
module when the Alpaka backend is not explicitly specified.
Currently, the ProcessAcceleratorAlpaka
(which plays python-side role in how the @alpaka
-modules are handled) requires that also the explicitly set backends must be compatible with the process.options.accelerators
.
In a way CPU is a special "accelerator" as it is always (assumed to be) present. And non-Alpaka code will use the CPU anyway. So perhaps just allowing explicitly-set host backends irrespective of the contents of process.options.accelerators
would be ok.
If the previous case would be allowed, what about setting the backend explicitly to anything? For example, module.alpaka.backend = 'cuda_async'
when accelerators = ['cpu']
, should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".
what about setting the backend explicitly to anything? For example,
module.alpaka.backend = 'cuda_async'
whenaccelerators = ['cpu']
, should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".
I agree that I think this should fail.
What about the case where a job uses alpaka_cuda_async::producer
explicitly ?
Should that fail as well ?
What about the case where a job uses
alpaka_cuda_async::producer
explicitly ? Should that fail as well ?
Actually, that fails because
----- Begin Fatal Exception 25-Jan-2024 01:25:05 CET-----------------------
An exception of category 'NotFound' occurred while
[0] Processing Event run: 1 lumi: 1 event: 1 stream: 0
[1] Running path 'process_path'
[2] Calling method for module alpaka_cuda_async::TestAlpakaProducer/'testProducer'
Exception Message:
Service Request unable to find requested service with compiler type name ' alpaka_cuda_async::AlpakaService'.
----- End Fatal Exception -------------------------------------------------
What about the case where a job uses
alpaka_cuda_async::producer
explicitly ? Should that fail as well ?Actually, that fails because
I'm glad alpaka_cuda_async::producer
alone fails. Theoreically user could still hack it to work with explicit process.add_(cms.Service('alpaka_cuda_async::AlpakaService'))
and somehow removing ProcessAcceleratorAlpaka
from the process
. But I hope this level is something that would be caught in the code review (plus removal of ProcessAcceleratorAlpaka
would break all modules relying on @alpaka
suffix).
An
@alpaka
module with an explicit CPU backend such aswill fail to run if the
process
is configured to exclude the CPU from the accelerators:with the message:
Currently, the workaround is to use the
alpaka_serial_sync::
variant explicitly: