OSGeo / grass

GRASS GIS - free and open-source geospatial processing engine
https://grass.osgeo.org
Other
823 stars 302 forks source link

[Feat] ParallelModuleQueue (python multiprocessing): don't wait for entire block to finish before pulling new processes #2478

Open griembauer opened 2 years ago

griembauer commented 2 years ago

The option to run GRASS modules in parallel (in python) is implemented via the ParallelModuleQueue class. The standard way (?) is to define a processing queue via an nprocs parameter, add GRASS modules to be executed in parallel via the put() method and finally start the parallel processing using the wait() method. The way it is implemented now, the queue seems to run a number of processes defined by nprocs and waits for all processes to finish before starting the next "block" of processes. This means that the longest process determines the duration of an entire processing "block". Ideally, free slots could be filled directly with pending processes from the queue instead.

petrasovaa commented 2 years ago

I agree that is a problem, which is partially the reason I usually just use standard Python multiprocessing.Pool methods (like map_async) with run_command. Just curious, do you prefer ParallelModuleQueue for some specific reason?

griembauer commented 2 years ago

No, not at all, I am just used to using it since it is the pygrass way ;) Also, some GRASS modules from the temporal framework use ParallelModuleQueue, e.g. for aggregation: https://github.com/OSGeo/grass/blob/1961472afeb7633c9b744b0a60c923fb9b1d4411/python/grass/temporal/aggregation.py#L267