bertsky / workflow-configuration

a makefilization for OCR-D workflows, with configuration examples
Apache License 2.0
9 stars 5 forks source link

parallel execution: GPU resources #1

Closed bertsky closed 4 years ago

bertsky commented 5 years ago

When executing workflows with --jobs, CPU resources will be employed in parallel (up to the requested number of jobs or as automatically determined via load factor). (This happens when recurring into workspaces, whereas individual workspaces are built sequentially.)

But this fails when some processors in the workflow require GPU resources that cannot be shared, at least not with the same number of parallel jobs. These processors will then randomly face out-of-memory errors like this...

CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

...or that...

OOM when allocating tensor with shape[1475200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Therefore, we need a mechanism to