meiertgrootes commented 4 years ago

As team atlas working on the phenology co-clustering notebooks, we would like to analyse the notebooks in detail (with focus on the implementation of the algorithm and its deployment and usage with dask). This will enable us to improve and robustly run the co-clustering analysis with dask

fnattino commented 4 years ago

Co-clustering notebook

Implementation:

The co-clustering implementation entails two loops:

An inner ‘while’ loop, which implements the actual co-clustering optimisation. It involves multiple steps in an iterative procedure that runs until convergence.
An outer ‘for’ loop, which consists of a cycle over differently-initialised co-clustering runs. The co-clustering instance with the lowest cost function is finally selected and returned as the final result. The function entailing this second loop is defined asynchronously.

The current implementation involves two parallelisation/distribution layers:

The first layer involves the various co-clustering runs (i.e. the elements of the outer ‘for’ loop), which are independent tasks that can be run simultaneously without communication;
The second layer involves the arrays’ data chunks, for which blocked algorithms that require minimised communication are implemented within Dask. The tasks arising from both layers are submitted to the same scheduler, which takes care of distributing the tasks in an optimal way. Performance-wise this is great, because the workers' idling time during execution is minimised. Memory-wise, this might not be the most efficient approach, because all co-clustering runs require simultaneous memory allocation, with all results being ultimately stored before selecting the lowest-cost-function co-clustering result. An approach that might be more suitable for big data could exploit just the Dask internal parallelisation strategy (i.e. the second layer above), and a run-by-run loop that would require only the results of two runs to be kept in memory simultaneously.

In more detail, the Dask implementation in the notebook involves data and computations in two states (see here for details): i) lazy (delayed) tasks and ii) tasks that are running in the distributed memory (future objects). Lazy tasks are stored in a graph, which grows each time an operation is performed on a Dask collection (e.g. Dask arrays). When an array’s persist method is called, the graph is run up to the top-most elements, which are converted into local futures that points to the actual data in the distributed memory.

Issues:

Asynchronous co-clustering function: The issue we faced when first running the notebook is related to the asynchronous environment in the outer ‘for’ loop. The futures generated when ‘persisting’ Dask arrays are found to return co-routine objects instead of actual data. The compute method of the resulting Dask arrays tries to access the results of the futures without ‘awaiting’ for them, triggering an exception. Unclear elements:
- Is this a bug in Dask? Can this be fixed maintaining the current Dask implementation (lazy tasks + futures + asynchronous mode)?
- How was this version of the notebook run in the past?
- Most importantly, what is the advantage of the asynchronous implementation? The co-clustering function implementing the outer ‘for’ loop becomes non-blocking in asynchronous mode, but the blocking step is only delayed to the analysis stage, adding a layer of complexity to the Dask implementation. If the asynchronous environment is dropped, the implementation is simplified and the notebook runs fine.
Dask by default assumes functions to be pure (same input + same function = same output, with no side effects). Thus, the function implementing the inner ‘while’ loop, within which the cluster coefficients are initialised, is actually run only once (futures pointing to the same output are returned for the other runs). The argument pure=False needs to be specified to ensure that the scheduler runs all tasks in this case.
The algorithm employed to initialise the row and columns cluster occupations differ in a “+1” term. It looks like the row implementation is correct, since not all cluster are initialised for columns if num cluster ~ num columns. Also, the cluster initialisation takes place with a deterministic algorithm. Random indices are generated (presumably) to shuffle the initial cluster assignment but the reordering step is currently missing.

fnattino commented 4 years ago

The original version of the notebook and a simplified co-clustering implementation for the toy model is available here: https://github.com/phenology/hsr-phenological-modelling/tree/co_clustering/co-clustering/notebooks

fnattino commented 4 years ago

Communicated to Serkan & Raul

NLeSC / team-atlas

Identify Issues in co-clustering notebook #71

Co-clustering notebook

Implementation:

Issues: