This is a cheap metric that probably sometimes approximates topological order. Of course, it's wrong for any fan-out operations (repartition, shuffle, etc.).
But progress might be easier to watch and decipher if it was in actual topological order. Bars would then be most full at the top, and least full at the bottom.
It took me a long time of using dask to actually understand what the progress bars were showing, I think because it felt so random which ones were completing first.
It seems doable to maintain topological ordering, but might require some state in between updates to do efficiently.
Current
In the above example, make-timeseries comes first in topological order, then the repartitions, then sub, then dataframe-count and dataframe-sum
I'm wondering how to best implement this. Maybe task groups should/could/must track min/max priorities of contained tasks? Maybe min/max priorities per Computation to not have mixed up state.
I believe the progress bars on the dashboard are currently sorted by group size (largest first):
https://github.com/dask/distributed/blob/bfc5cfea80450954dba5b87a5858cb2e3bac1833/distributed/diagnostics/progress_stream.py#L94
This is a cheap metric that probably sometimes approximates topological order. Of course, it's wrong for any fan-out operations (repartition, shuffle, etc.).
But progress might be easier to watch and decipher if it was in actual topological order. Bars would then be most full at the top, and least full at the bottom.
It took me a long time of using dask to actually understand what the progress bars were showing, I think because it felt so random which ones were completing first.
It seems doable to maintain topological ordering, but might require some state in between updates to do efficiently.
Current
In the above example,
make-timeseries
comes first in topological order, then therepartition
s, thensub
, thendataframe-count
anddataframe-sum
Proposed