EOEPCA / datacube-access

EOEPCA+ Datacube Access BB
Apache License 2.0
0 stars 0 forks source link

Data Cube Access - Requirements Analysis and Architectural Design - EOX Q3 - #3201 #16

Open Schpidi opened 2 months ago

jankovicgd commented 4 weeks ago

Following my research on Zarr, NetCDF, and Datacube access and Testbed-20, I have a few questions and thoughts to share. @jonas-eberle

It seems the community is leaning towards Zarr for cloud-optimized data, given the lack of standards around NetCDF in that area. Do you agree with this assessment?

To further explore Zarr's potential, I'd love to hear about any concrete use cases you and your users have encountered. This will help us determine if we should prioritize potential zarr implementations or continue focusing on COGs.

If Zarr is indeed a priority, could you suggest a specific requirement related to higher-dimension data, coverages, and the STAC API that we could tackle as a first step?

jonas-eberle commented 4 weeks ago

@jankovicgd Zarr becomes very important as soon as ESA will distribute all their products in the Zarr format, which is already on their roadmap. In addition we have other datasets, such as MODIS and VIIRS that are distributed as HDF or netCDF format.

Zarr is also an important format for data scientists when they use distributed computing resources and the processing jobs try to store temporary or final data into one file in parallel (not possible with COG but with Zarr).

But still many datasets are also provided in COG format (e.g., Landsat data or data publications from data scientists).

jankovicgd commented 3 weeks ago

Thank you for the information. To help us prioritize our efforts effectively, could you please provide a specific requirement related to Zarr and OGC API - Coverages? For example, we could focus on how to efficiently subset a Zarr array through a /coverage request with a specific bounding box, time range, and variable and return a zipped/tarred zarr.

Also have in mind the users who are going to use this. My research into cloud optimized formats and current libraries points to users wanting to directly work with zarr data instead of through an interface. Also thinking a bit wider on the above example, requiring the user to download the zarr and then unzip it to work with the data may not be very user friendly. The STAC API helps them find the zarr, but then I am somehow failing to see the benefit of the coverages interface.

If you have also a defined user journey, that could help shed some light.

Given our recent resource constraints, we want to focus on well-defined tasks to ensure efficient progress.