dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.43k stars 1.7k forks source link

Compile Blockwise array sub graphs #6123

Open mrocklin opened 4 years ago

mrocklin commented 4 years ago

We currently collect a broad class of operations into high level graphs.

In [1]: import dask.array as da                                                                          

In [2]: from dask.blockwise import optimize_blockwise                                                    

In [3]: x = da.ones((10, 10), chunks=(5, 5))                                                             

In [4]: y = da.cos(x) ** 2 + da.sin(x) ** 2                                                              

In [5]: graph = optimize_blockwise(y.dask)                                                               

In [6]: {k: type(v) for k, v in graph.layers.items()}                                                    
Out[6]: 
{'add-a2d88408268290d7edfe4ede916b05b4': dask.blockwise.Blockwise,
 'ones-b8c5986474b86c6e3a94246ecba1c6e4': dict}

In [7]: layer = graph.layers["add-a2d88408268290d7edfe4ede916b05b4"]                                     

In [8]: layer                                                                                            
Out[8]: Blockwise<((2, None), ('ones-b8c5986474b86c6e3a94246ecba1c6e4', ('.1', '.0'))) -> add-a2d88408268290d7edfe4ede916b05b4>

In [9]: layer.dsk                                                                                        
Out[9]: 
{'add-a2d88408268290d7edfe4ede916b05b4': (<function _operator.add(a, b, /)>,
  'pow-235ad71f84352db184f344e57d4ac82d',
  'pow-4f17fda523d5e314cbcdb23b6972e013'),
 'pow-235ad71f84352db184f344e57d4ac82d': (<function _operator.pow(a, b, /)>,
  'cos-58716fab65fe91a0e56f8d12baabec04',
  '_0'),
 'cos-58716fab65fe91a0e56f8d12baabec04': (<ufunc 'cos'>, '_1'),
 'pow-4f17fda523d5e314cbcdb23b6972e013': (<function _operator.pow(a, b, /)>,
  'sin-968a3d0e9b788b4f0cc55ac5c3e938f7',
  '_0'),
 'sin-968a3d0e9b788b4f0cc55ac5c3e938f7': (<ufunc 'sin'>, '_1')}

In [10]: layer.__dict__                                                                                  
Out[10]: 
{'output': 'add-a2d88408268290d7edfe4ede916b05b4',
 'output_indices': ('.1', '.0'),
 'dsk': {'add-a2d88408268290d7edfe4ede916b05b4': (<function _operator.add(a, b, /)>,
   'pow-235ad71f84352db184f344e57d4ac82d',
   'pow-4f17fda523d5e314cbcdb23b6972e013'),
  'pow-235ad71f84352db184f344e57d4ac82d': (<function _operator.pow(a, b, /)>,
   'cos-58716fab65fe91a0e56f8d12baabec04',
   '_0'),
  'cos-58716fab65fe91a0e56f8d12baabec04': (<ufunc 'cos'>, '_1'),
  'pow-4f17fda523d5e314cbcdb23b6972e013': (<function _operator.pow(a, b, /)>,
   'sin-968a3d0e9b788b4f0cc55ac5c3e938f7',
   '_0'),
  'sin-968a3d0e9b788b4f0cc55ac5c3e938f7': (<ufunc 'sin'>, '_1')},
 'indices': ((2, None),
  ('ones-b8c5986474b86c6e3a94246ecba1c6e4', ('.1', '.0'))),
 'numblocks': {'ones-b8c5986474b86c6e3a94246ecba1c6e4': (2, 2)},
 'concatenate': None,
 'new_axes': {}}

This includes elementwise operations (as above) and also transpose, the first bits of reductions, tensordot, and so on. The object layer.dsk is a computation to run in a single task. Normally it gets evaluated by dask.get (I think) (a simple single threaded scheduler) by calling these functions in a sensible order on the input chunks.

However we could also choose to be more intelligent here, and modify this sequence of functions. This is a good time to perform intelligent optimizations because we know that this one sub-graph is likely both small, and likely to be run many times across all of our chunks. Two optimizations have come up in the past:

  1. Using inplace operations, like rewriting a = a + 1 to a += 1 when safe to do so
  2. Handing everything to Numba, which maybe does optimizations like the above, and hopefully others.
  3. Handing everything to Numba, and optionally using some of the fancier flags within numba, like fastmath=True

This came up briefly in https://github.com/dask/dask/issues/1964 . @jcrist may also have an experiment lying around

GPUs

Avoiding memory copies becomes more important when the underlying chunks are cupy rather than numpy arrays. An optimization of this sort may also be interested in knowing what the metadata of the input chunks are.

jakirkham commented 4 years ago

cc @quasiben @madsbk @pentschev

martindurant commented 3 years ago

Whilst looking through old issues, this one seems to have gone without activity for some time. Is it still of interest given the changes in high-level-graphs since then? If "yes" this sounds like a cool self-contained project that we should specifically task someone with.