VisionSystemsInc / terra

Terra - Run your algorithm anywhere on earth
MIT License
0 stars 3 forks source link

ProcessPoolExecutor fork vs. spawn #153

Open drewgilliam opened 1 year ago

drewgilliam commented 1 year ago

Terra resource management is not compatible with the spawn start method for ProcessPoolExecutor. Current workaround is to use a ThreadPoolExecutor.

torch appears to require the spawn start method for ProcessPoolExecutor https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing

This can be set using the following torch code

torch.multiprocessing.set_start_method("spawn", force=True)

Or setting mp_context when initializing the ProcessPoolExecutor

from multiprocessing import get_context
from terra.executor.process import ProcessPoolExecutor
mp_context = get_context('spawn')
Executor = ProcessPoolExecutor(max_workers=3, mp_context=mp_context)

Unfortunately, a spawned ProcessPoolExecutor will re-import python modules for each child process, meaning the resource lock directory is different for each child process due to the dependency on the os.getpid() https://github.com/VisionSystemsInc/terra/blob/e24792b8d0ec91f7c054c21930564ab3c586115e/terra/executor/resources.py#L126-L129

As each child process uses a different lock directory, the result is each child process has no awareness of other child process resource locks. Each child process is thus able to claim the first resource which results in processing failure.

Testing the spawn start method is possible by adding the following to test_executor_resources.py after TestResourceProcess. However, this change currently results in a different error where the data dictionary is empty due to each spawned child re-importing the test module (e.g., simple_acquire is unable to find data[name])

https://github.com/VisionSystemsInc/terra/blob/e24792b8d0ec91f7c054c21930564ab3c586115e/terra/tests/test_executor_resources.py

class ProcessPoolExecutorSpawn(ProcessPoolExecutor):
  def __init__(self, *args, **kwargs):
    kwargs['mp_context'] = get_context('spawn')
    return super().__init__(*args, **kwargs)

class TestResourceProcessSpawn(TestResourceProcess):
  # Test for multiprocess spwan case
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.Executor = ProcessPoolExecutorSpawn

Issue discovered by @decrispell during terra_real3d development, attempting to run multiple torch tasks each with a single assigned GPU.