OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
94 stars 73 forks source link

Combined EchoData `.to_zarr` chunking error #1194

Closed ctuguinay closed 10 months ago

ctuguinay commented 10 months ago

General description of problem

When trying to run to_zarr on a combined echodata object, it gave me a chunking error. When setting compress in to_zarr to false, the save was able to work.

Computing environment

Minimum example

The following code reproduces the errors I encountered:

for ed_filename in ed_filenames:
    ed_list.append(ep.open_converted(ed_filename))
...
combined_ed = ep.combine_echodata(ed_list)
combined_ed.to_zarr("test.zarr", compress=True)

Error message printouts

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/ec2-user/hake-labels/test.ipynb Cell 14 line 1
----> [1](vscode-notebook-cell://ssh-remote%2Baws/home/ec2-user/hake-labels/test.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) combined_ed.to_zarr("test.zarr", compress=True)

File [~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/echodata/echodata.py:677](https://vscode-remote+ssh-002dremote-002baws.vscode-resource.vscode-cdn.net/home/ec2-user/hake-labels/~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/echodata/echodata.py:677), in EchoData.to_zarr(self, save_path, compress, overwrite, parallel, output_storage_options, consolidated, **kwargs)
    652 """Save content of EchoData to zarr.
    653 
    654 Parameters
   (...)
    673     xarray's documentation for a list of all possible arguments.
    674 """
    675 from ..convert.api import to_file
--> 677 return to_file(
    678     echodata=self,
    679     engine="zarr",
    680     save_path=save_path,
    681     compress=compress,
    682     overwrite=overwrite,
    683     parallel=parallel,
    684     output_storage_options=output_storage_options,
    685     consolidated=consolidated,
    686     **kwargs,
    687 )

File [~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/convert/api.py:90](https://vscode-remote+ssh-002dremote-002baws.vscode-resource.vscode-cdn.net/home/ec2-user/hake-labels/~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/convert/api.py:90), in to_file(echodata, engine, save_path, compress, overwrite, parallel, output_storage_options, **kwargs)
     88     else:
     89         logger.info(f"saving {output_file}")
---> 90     _save_groups_to_file(
     91         echodata,
     92         output_path=io.sanitize_file_path(
     93             file_path=output_file, storage_options=output_storage_options
     94         ),
     95         engine=engine,
     96         compress=compress,
     97         **kwargs,
     98     )
    100 # Link path to saved file with attribute as if from open_converted
    101 echodata.converted_raw_path = output_file

File [~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/convert/api.py:120](https://vscode-remote+ssh-002dremote-002baws.vscode-resource.vscode-cdn.net/home/ec2-user/hake-labels/~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/convert/api.py:120), in _save_groups_to_file(echodata, output_path, engine, compress, **kwargs)
    110 io.save_file(
    111     echodata["Top-level"],
    112     path=output_path,
   (...)
    116     **kwargs,
    117 )
    119 # Environment group
--> 120 io.save_file(
    121     echodata["Environment"],  # TODO: chunking necessary?
    122     path=output_path,
    123     mode="a",
    124     engine=engine,
    125     group="Environment",
    126     compression_settings=COMPRESSION_SETTINGS[engine] if compress else None,
    127     **kwargs,
    128 )
    130 # Platform group
    131 io.save_file(
    132     echodata["Platform"],  # TODO: chunking necessary? time1 and time2 (EK80) only
    133     path=output_path,
   (...)
    138     **kwargs,
    139 )

File [~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/utils/io.py:65](https://vscode-remote+ssh-002dremote-002baws.vscode-resource.vscode-cdn.net/home/ec2-user/hake-labels/~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/echopype/utils/io.py:65), in save_file(ds, path, mode, engine, group, compression_settings, **kwargs)
     63     for var, enc in encoding.items():
     64         ds[var] = ds[var].chunk(enc.get("chunks", {}))
---> 65     ds.to_zarr(store=path, mode=mode, group=group, encoding=encoding, **kwargs)
     66 else:
     67     raise ValueError(f"{engine} is not a supported save format")

File [~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/xarray/core/dataset.py:2474](https://vscode-remote+ssh-002dremote-002baws.vscode-resource.vscode-cdn.net/home/ec2-user/hake-labels/~/mambaforge/envs/hake_labels/lib/python3.9/site-packages/xarray/core/dataset.py:2474), in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   2342 """Write dataset contents to a zarr group.
   2343 
   2344 Zarr chunks are determined in the following way:
   (...)
   2470     The I/O user guide, with more details and examples.
   2471 """
   2472 from xarray.backends.api import to_zarr
-> 2474 return to_zarr(  # type: ignore[call-overload,misc]
   2475     self,
   2476     store=store,
   2477     chunk_store=chunk_store,
   2478     storage_options=storage_options,
   2479     mode=mode,
   2480     synchronizer=synchronizer,
   2481     group=group,
   2482     encoding=encoding,
   2483     compute=compute,
   2484     consolidated=consolidated,
   2485     append_dim=append_dim,
   2486     region=region,
   2487     safe_chunks=safe_chunks,
   2488     zarr_version=zarr_version,
...
    273         )
    274 else:
    275     for k in list(encoding):

ValueError: unexpected encoding parameters for zarr backend:  ['preferred_chunks']

Provide an example file

Ed files combined were produced from the following raw files found in https://ncei-wcsd-archive.s3-us-west-2.amazonaws.com/data/raw/Bell_M._Shimada/SH1305/EK60/: SaKe_2013-D20130625-T034532.raw SaKe_2013-D20130625-T035006.raw

emiliom commented 10 months ago

Thanks for creating this issue.

I don't have a solution. But I did a little digging, and just wanted to document that in the echopype code base, "preferred_chunks" occurs only once, in a variable assignment in utils/coding.py (and also in a test, test_echodata_combine.py, where it's skipped):

PREFERRED_CHUNKS = "preferred_chunks"

But then the variable PREFERRED_CHUNKS is never used anywhere. This is odd.

lsetiawan commented 10 months ago

Should be solved by #1128, which has been merged to dev. @ctuguinay Could you please confirm and report here. Thanks!

ctuguinay commented 10 months ago

@lsetiawan Yup, this has been solved! The compute regrid script on 2011-2015 was able to run without any external popping of encodings nor the setting of compress to false.