Blosc / community

General discussions on present and future of Blosc projects
1 stars 1 forks source link

CZI Essential OSS Cycle 3 #1

Open FrancescAlted opened 4 years ago

FrancescAlted commented 4 years ago

We are planning to apply to a CZI Essential OSS Cycle 3 via NumFOCUS. This issue is meant as a discussion tool with the community.

We have a tight deadline (Aug 4th), but thanks to a good handful of people, the goals are pretty well defined already:

  1. Create a codec specific for n-dim data in Blosc2
  2. Create interfaces for the existing Python ecosystem
  3. Approach existing and new communities
  4. Support new biomedical applications

We still need to isolate the tasks and create a budget. After that, we will send the application to NumFOCUS for their review. With your collaboration and a bit of luck I think we can manage to apply on-time.

@Blosc/core-devs @scopatz

FrancescAlted commented 4 years ago

For goal 1, creating a codec specific for n-dim data, @oscargm98 already started exploratory work at https://github.com/oscargm98/c-blosc2/tree/Blosc-ndlz. Hopefully this could be consolidated next year.

FrancescAlted commented 4 years ago

For goal 2, we are hoping that our work on Blosc2/Caterva will benefit the Zarr community (@zarr-developers/core-devs), but also we would like to facilitate plugins for xarray (@shoyer), dask (@mrocklin) and napari (@jni). In particular, we hope that the new backend system for xarray would have been finished soon so that we can leverage it. On its hand napari seems to have a nice plugin system already in-place, and I think that providing the necessary interfaces to dask would be fine too.

FrancescAlted commented 4 years ago

Goal 3 will require quite a lot of work on docs and community interation. @albertosm27 has already done a good job of setting up a nice initial site for Blosc-related docs, but documention continues to be sparsed around many different places. We also need more work on tutorials for make the new comers to easly grasp the basics about Blosc2/Caterva. Finally, API/format safety issues are important here and even though @nmoinvaz is making a good job here, we still need quite a bit of more work in this area.

FrancescAlted commented 4 years ago

Regarding goal 4, biomedical applications are important for CZI, and I am happy that we have onboard Brent Pedersen (@brentp) and Josh Moore (@joshmoore) who are strong the in the fields of genomics and microscopy applications so as to guide us on the requirements in this fields and make our software more useful for them.

FrancescAlted commented 4 years ago

Maybe a bit late, but @kif would be interested in this initiative too.

shoyer commented 4 years ago

For goal 2, we are hoping that our work on Blosc2/Caterva will benefit the Zarr community (@zarr-developers/core-devs),

This sounds very exciting! Ping @alimanfoo for zarr.

In particular, we hope that the new backend system for xarray would have been finished soon so that we can leverage it.

To clarify: do you hope to implement something more like a new file format for storing xarray data on disk, or a new computation backend for working with xarray arrays in memory? We already have pretty good support for the later via NumPy's __array_function__ interface. See xarray's roadmap for more elaboration on these ("flexible storage" vs "flexible arrays")

FrancescAlted commented 4 years ago

For goal 2, we are hoping that our work on Blosc2/Caterva will benefit the Zarr community (@zarr-developers/core-devs),

This sounds very exciting! Ping @alimanfoo for zarr.

To clarify: I expect Zarr to be benefited mainly from the new features in Blosc2. Caterva is essentially a multidimensional container with its own format, so adopting that inside Zarr would mean to break forward compatibilty, and I am not sure this is a good thing. But it is up the Zarr devs to decide whether they would like to adopt Caterva inside Zarr indeed.

In particular, we hope that the new backend system for xarray would have been finished soon so that we can leverage it.

To clarify: do you hope to implement something more like a new file format for storing xarray data on disk, or a new computation backend for working with xarray arrays in memory? We already have pretty good support for the later via NumPy's __array_function__ interface. See xarray's roadmap for more elaboration on these ("flexible storage" vs "flexible arrays")

I was referring more to the former: adding a new file format for storing xarray data on disk. My understanding is that this process is bit involved currently, and hoping you are trying to make the support of new storage backends easier.