OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
95 stars 73 forks source link

Consolidate Zarr metadata after conversion #233

Closed ngkavin closed 1 year ago

ngkavin commented 3 years ago

Consolidating metadata will greatly increase the speed of reading array metadata, as described here: https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata.

It looks like the metadata can be consolidated with zarr.consolidate_metadata(store) after conversion.

lsetiawan commented 3 years ago

If it's an xarray Dataset object, all you need is Dataset.to_zarr(consolidated=True)

ngkavin commented 3 years ago

Do you know if this something that should be done for every group or just the last one?

lsetiawan commented 3 years ago

Should be all the groups I think. @emiliom @leewujung , Do you guys have any thoughts on this?

emiliom commented 3 years ago

Only groups that are chunked. Is any group besides Beam chunked?

ngkavin commented 3 years ago

Groups with a time or range_bin dimension are chunked. For EK80 this also includes the platform group.

From testing, the '.zmetadata' KeyError is raised when setting consolidated=True in xarray to_zarr() if the .zmetadata does not exist and we are appending to an existing zarr file. .zmetadata needs to be created by calling zarr.consolidate_metadata after creating the zarr file in the top_level group because xarray is not being used to create zarr files, only appending to them.

ngkavin commented 3 years ago

So it might be best to simply call zarr.consolidate_metadata at the very end of conversion. Otherwise we would need to call it at the beginning as well as set consolidated=True each time we save with xarray.

leewujung commented 3 years ago

In the latest conversation we discussed:

We still need to:

leewujung commented 1 year ago

@lsetiawan @b-reyes : this doesn’t necessarily need to be in 0.6.3, but since we’re dealing with zarr data in this release so much, does it make sense for us to address this here?

b-reyes commented 1 year ago

@lsetiawan @b-reyes : this doesn’t necessarily need to be in 0.6.3, but since we’re dealing with zarr data in this release so much, does it make sense for us to address this here?

For xr.Dataset.to_zarr, I believe the default value for consolidated=None already consolidates the metadata (if it is possible). If this is the case, I think we do not need to worry about manually doing this. Maybe the functionality of to_zarr has changed since this issue was created? I can look more into this and see if my thinking it correct. Either way, I think we don't need to worry about this for 0.6.3 since I already correctly consolidate the metadata in combine_echodata.

lsetiawan commented 1 year ago

@b-reyes consolidated needs to be True for zar to consolidate. I think all you need is to make sure that every Dataset.to_zarr has the consolidated=True flag. Shouldn't be much to address for this upcoming release.

leewujung commented 1 year ago

From discussion: we will set it up that all .to_zarr/netcdf will default to cnosolidated=True, but users will be able to pass in kwargs to optionally overwrite the default operation.