Send dask graphs sequentially to scheduler

fmi-faim / faim-ipa

A collection of Image Processing and Analysis (IPA) functions used at the Facility for Advanced Imaging and Microscopy (FAIM)

BSD 3-Clause "New" or "Revised" License

9 stars 6 forks source link

Send dask graphs sequentially to scheduler #72

Closed tibuch closed 10 months ago

tibuch commented 11 months ago

Each dask computation graph which is sent to the scheduler requires the scheduler to do some book keeping. Although this work is "small" it does not scale too well with 10'000 tasks per graph. While the scheduler is integrating a new computation graph into the current existing list of tasks, scheduling is slowed down significantly. Since we know that each computation graph belongs to a single well, we can wait until one well is fully processed before we submit the next well. This will keep the number of managed tasks to a minimum.

tibuch commented 11 months ago

With the sub-process enabled local cluster we can prepare the next computation graph while the current well is computed.

tibuch commented 10 months ago

For large acquisitions (3D with multiple fields) this is still very slow. The reason is the size of a single computation graph which can reach between 10-20MB in size. A dask warning suggests to split the problem into smaller sub-problems. In our case this would probably mean to do the stitching and binning in one step and write this result to disk without the resolution pyramid levels. These levels are computed afterwards with da.from_zarr, da.coarsen and da.to_zarr. This breaks up the large computation graph into multiple smaller ones.