OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
96 stars 73 forks source link

Enchance `ep.calibrate.compute_Sv` scalability by enabling EchoData chunking and removing eager computation #1331

Closed ctuguinay closed 3 months ago

ctuguinay commented 3 months ago

Addresses #1329

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.60%. Comparing base (9f56124) to head (1828b68). Report is 117 commits behind head on main.

Files Patch % Lines
echopype/echodata/echodata.py 92.30% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1331 +/- ## ========================================== - Coverage 83.52% 75.60% -7.92% ========================================== Files 64 22 -42 Lines 5686 1746 -3940 ========================================== - Hits 4749 1320 -3429 + Misses 937 426 -511 ``` | [Flag](https://app.codecov.io/gh/OSOceanAcoustics/echopype/pull/1331/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OSOceanAcoustics/echopype/pull/1331/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics) | `75.60% <96.15%> (-7.92%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ctuguinay commented 3 months ago

Example Compute Sv Gist with OOI Data much larger than what was tested on in #1212: https://gist.github.com/ctuguinay/6195d318037f80eb60d9e9cb5dd7aa77

ctuguinay commented 3 months ago

Also some remove background noise tests were failing, but that was out of the scope of this PR so I left it as an issue in #1332

ctuguinay commented 3 months ago

TODO: The use_swap implementation in open_raw does something different with the Zarr store, but I didn't understand it so I stuck with what worked in my notebook. I need to investigate what I am currently doing: What happens when I save an Xarray Dataset object automatically to the zarr store from temp_zarr_store. And I need to investigate what the original use_swap in open_raw is doing when it is explicitly saving to the zarr root of the zarr store from temp_zarr_store.

EDIT:

ctuguinay commented 3 months ago

Tested new functionality with a 2 million ping array (1 month of 2017 OOI EK60 data): https://gist.github.com/ctuguinay/8dbf6b89d9a58adbdefc90a63c112cb6

ctuguinay commented 3 months ago

@leewujung This should be ready to review now

ctuguinay commented 3 months ago

Oh one more slight TODO: In the new test, test the use_swap and chunk_dict ds["Sv"] values against Echoview values.

EDIT: Added to both tests that check against Echoview and Matlab

ctuguinay commented 3 months ago

Investigate zarr.sync.ThreadSynchronizer(). Maybe also add some kwarg for storage options.

Edit: Added this

ctuguinay commented 3 months ago

Wouldn't this be equivalent to below?

Oh are you thinking I move the to_zarr to the API wrapper functions (compute_Sv and compute_TS)? I'm not quite sure I understand the question.

And also does this code below also imply the creation of this temporary Zarr Store?

cal_ds = _compute_cal(...)
cal_ds.to_zarr(final_zarr_store)
ctuguinay commented 3 months ago

My question about why even without chunking, using use_swap=True would automatically make the number of layers = 2.

The order of events would go like this (in the case where cal_object is not chunked but use_swap=True:

  1. cal_object is computed and is in memory, and is not a dask array and thus has no dask graph layers
  2. cal_object is sent to Zarr Store
  3. cal_object is lazy loaded back into memory with no actual computation layers attached to it (other than the initial 2 Zarr Store reading layers) since it had no real computation layers to begin with prior to being sent to the Zarr Store
ctuguinay commented 3 months ago

TODOs (from conversation with @leewujung today):

leewujung commented 3 months ago

@ctuguinay :

ctuguinay commented 3 months ago

@leewujung I agree with the above 👍

ctuguinay commented 3 months ago

@leewujung Thanks for the review! This should be ready for review again. Main thing I did was do this backscatter nbyte calculation based on sonar model, encode mode, waveform mode, and echodata beam group.

ctuguinay commented 3 months ago

@leewujung Made your suggestion to extract encode and waveform from the calibration object