chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
78 stars 20 forks source link

Communicate download progress to Census API users #1033

Closed hthomas-czi closed 3 months ago

hthomas-czi commented 6 months ago

Visibility of system status is an important usability heuristic. Users want to know how long a data fetching call will take and how much progress is being made.

<Start Pablo's edit> Applies to

SOMA.DataFrame.read().concat() SOMA.SparseNDArray.read().concat() cellxgene_census.download_source_h5ad() <End Pablo's edit>

User Quote

I’d like to see progress bars to understand how long to wait for download to finish

pablo-gar commented 5 months ago

@ebezzi for the SOMA-related task let's ask TileDB if there is any possibility to implement anything on this.

ivirshup commented 3 months ago

This is very straightforward for cellxgene_census.download_source_h5ad. Basically just:

from fsspec.callbacks import TqdmCallback

...

    fs = s3fs.S3FileSystem(
        anon=True,
        cache_regions=True,
    )
    fs.get_file(
        locator["uri"],
        to_path,
        callback=TqdmCallback(),
    )

image

It sounds like using this for other things is more of a SOMA issue.

prathapsridharan commented 3 months ago

@ivirshup @pablo-gar @ebezzi @hthomas-czi - In regards to progress for other APIs - Is supplying way of getting progress for programmatic APIs a common expectation?

download_* seems straightforward enough but I can't recall expecting a progress indicator when programmatically invoking APIs on a terminal or script.

Would be good to have examples from other domains where programmatic usage of an API also has a mechanism to communicate progress of work.

ivirshup commented 3 months ago

download_* seems straightforward enough but I can't recall expecting a progress indicator when programmatically invoking APIs on a terminal or script.

I think giving the option to report progress (sometimes by default) is pretty common, wget for instance. dask also has a number of tools for reporting progress of ongoing tasks. A lot of bioinformatics tools (though especially older ones) also make this available.

It's nice when its done via stderr or some alternative stream.

pablo-gar commented 3 months ago

I think something that can suffice the request and is simple for now for SOMA.DataFrame.read().concat() and SOMA.SparseNDArray.read().concat() is:

Adding a report progress (either via progress bar, or stderr stdout dumps) in each iteration of the _arrow_table_reader.

@ivirshup what do you think? We can create a mini proposal for that change and discuss with TileDB folks.

ivirshup commented 3 months ago

Adding a report progress (either via progress bar, or stderr stdout dumps) in each iteration of the _arrow_table_reader.

I think that would work.

I also would really like to get some dask integration in tiledb-soma, which I think also give this ability.

ivirshup commented 3 months ago

@pablo-gar are you happy with the solution for download_source_h5ad in #1153?

To me, the main question left there is: do we want a way to turn this off?

pablo-gar commented 3 months ago

Yes we should give the user the ability to turn it off.

ivirshup commented 3 months ago

Cool.

I'll add a progress_bar: bool = True argument to the function?

pablo-gar commented 3 months ago

That works!

ivirshup commented 3 months ago

The PR is now ready for review. I also added units to the progress bar.

ivirshup commented 3 months ago

Fixed by #1153