OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
94 stars 73 forks source link

Replace the current `parsed2zarr` mechanism #1179

Closed leewujung closed 9 months ago

leewujung commented 11 months ago

This issue is to pick up from #1070 to replace the current parsed2zarr mechanism to ensure code is bug-free and easy to maintain in the future.

I am referencing my https://github.com/OSOceanAcoustics/echopype/pull/1070#issuecomment-1730079171 here:

The reason why i think the current complicated approach is unnecessary is because there are 2 stages of padding that is happening:

  1. the "padding" for each channel to become rectangular (so to pad the shorter pings so that all pings within a channel have the same length) is in pad_shorter_ping. That is the first NaN-padding expansion.
  2. The second NaN-padding expansion is when some channels lack a certain type of data, so NaN is padded for the entire "slice" of that channel to match the size of the other channels.

Expansion 1 happens for all data without xarray/zarr/dask (which can potentially be sped up or make more efficient?). After that, if we take note of the shape of variables in self.ping_data_dict["angle"][ch_id] for each ch_id , we would know exactly whether or not expansion 2 is needed. There, we can decide what is the strategy to use to write to the zarr store -- can we append sequentially along segments of ping_time since we know all the pings are same length? and for the channels that lack any data at all, creating a NaN slice for each sections (along ping_time segments) would likely let us hover under the memory limit.

@lsetiawan and I have done some investigation and determined a clean path forward.

leewujung commented 11 months ago

Notes while going through code:

Here I use the file test_data/ek60/L0003-D20040909-T161906-EK60.raw. For more info about this file, see my previous investigation here and here.

I "short-circuited" _append_channel_ping_data to force saving all parsed data to under parser.ping_data_dict (as recommended below), just so that I can see how things look easily, and create a mock data generator. This quick patch type of change is here (I think we should completely remove zarr_vars and associated stuff): https://github.com/leewujung/echopype/commit/e2a002996f3dafebf6d668e9b16df853330b5345

lsetiawan commented 9 months ago

This Issue is related to #966