NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
124 stars 19 forks source link

Optional progress bar for intensive incremental operations #183

Open sadielbartholomew opened 3 years ago

sadielbartholomew commented 3 years ago

For select operations that may take a non-negligible time to complete (from the perspective of a user), it could be useful to display small indicators of the progression of the operation whilst it is running in either interactive Python or on the command-line (i.e. for cfa, since aggregation is one of the main operations I have in mind that can necessarily be time-consuming for specifically inputs).

Notably, this could be:

so that the user can know roughly how long they can expect or wait for the completion, or in the latter case at least see visually and be more confident that the operation hasn't stalled.

(As we have seen, Dask has a similar feature for diagnostics etc.:

dask_progress_bar )

There may be some users that don't want this feature so I suggest we have a global configuration option to disable it, but enable by default.

Implementation options

It wouldn't be difficult to code up a primitive progress bar and spinner, but for something available out-of-the-box and with more options and support, I think it would be better to use a dedicated progress bar library.

There are various such options for Python (there's a good summary here) but I think tqdm seems the best choice at the present time for features, low overhead, support available etc.

davidhassell commented 3 years ago

Hi @sadielbartholomew. Thank you for raising this issue.

I'm not really up on how the progress bars are calculated, but the dask progress bar appears to be the progress of the number of tasks (in the graph) completed by the scheduler. As far as I can gather, tqdm only keeps track of progress through top-level iterable.

Perhaps a double approach could be good: dask's ProgressBar for computations (when we have migrated to dask!) and tqdm for cf-python iterables (as found in cf.aggregate, for example).

What do you think?

sadielbartholomew commented 3 years ago

Thanks @davidhassell. You make very good points and generally they reminded me that the parallelism introduced by Dask will mean that progress isn't as simple to deduce as simple sequential iterations or similar. Overall I think this Issue should definitely be addressed only after #182, not just because this is bells and whistles' work rather than core work but because the under-the-hood logic behind the operations will change and the whole concept of progress with it.

I'm not really up on how the progress bars are calculated, but the dask progress bar appears to be the progress of the number of tasks (in the graph) completed by the scheduler. As far as I can gather, tqdm only keeps track of progress through top-level iterable.

Indeed, it may be that we need a progress bar that is aware of the Dask task graph and means of computation etc. I'll bear it in mind whilst working on #182 but I suspect you are right that the Dask progress bar itself might be the only possibility.

Perhaps a double approach could be good: dask's ProgressBar for computations (when we have migrated to dask!) and tqdm for cf-python iterables (as found in cf.aggregate, for example). What do you think?

That sounds like a good tentative plan. Ideally for simplicity and to not add another dependence we could stick to just one type, so it seems like the Dask one, but it might not be possible if there is a complicated mix of cf-only logic and logic deferred over Dask. Let's see what seems sensible once the LAMA to Dask migration is winding up.