parallel execution: GPU resources

When executing workflows with --jobs, CPU resources will be employed in parallel (up to the requested number of jobs or as automatically determined via load factor). (This happens when recurring into workspaces, whereas individual workspaces are built sequentially.)

But this fails when some processors in the workflow require GPU resources that cannot be shared, at least not with the same number of parallel jobs. These processors will then randomly face out-of-memory errors like this...

CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

...or that...

OOM when allocating tensor with shape[1475200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Therefore, we need a mechanism to

know which processors allocate GPU resources
know how many such processors can be run in parallel
either reduce the multiscalar execution to that number in general, or
queue only those processors specifically (creating a local bottleneck for each workspace)

bertsky / workflow-configuration

parallel execution: GPU resources #1