Enchance `ep.calibrate.compute_Sv` scalability by enabling EchoData chunking and removing eager computation

ctuguinay commented 3 months ago

Addresses #1329

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.60%. Comparing base (9f56124) to head (1828b68). Report is 117 commits behind head on main.

Files	Patch %	Lines
echopype/echodata/echodata.py	92.30%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1331 +/- ## ========================================== - Coverage 83.52% 75.60% -7.92% ========================================== Files 64 22 -42 Lines 5686 1746 -3940 ========================================== - Hits 4749 1320 -3429 + Misses 937 426 -511 ``` | [Flag](https://app.codecov.io/gh/OSOceanAcoustics/echopype/pull/1331/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OSOceanAcoustics/echopype/pull/1331/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics) | `75.60% <96.15%> (-7.92%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OSOceanAcoustics#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ctuguinay commented 3 months ago

Example Compute Sv Gist with OOI Data much larger than what was tested on in #1212: https://gist.github.com/ctuguinay/6195d318037f80eb60d9e9cb5dd7aa77

ctuguinay commented 3 months ago

Also some remove background noise tests were failing, but that was out of the scope of this PR so I left it as an issue in #1332

ctuguinay commented 3 months ago

TODO: The use_swap implementation in open_raw does something different with the Zarr store, but I didn't understand it so I stuck with what worked in my notebook. I need to investigate what I am currently doing: What happens when I save an Xarray Dataset object automatically to the zarr store from temp_zarr_store. And I need to investigate what the original use_swap in open_raw is doing when it is explicitly saving to the zarr root of the zarr store from temp_zarr_store.

EDIT:

Each call of create_temp_zarr creates a new zarr folder in /tmp/echopype/ will Zarr folder name following this pattern: ep-swap--[RANDOM_ARRANGEMENT_OF_ALPHA_NUMERIC_CHARACTERS].zarr
Calling Dataset.to_zarr(temp_zarr) will result in saving all groups of the dataset normally in the temporary Zarr folder
Restarting the notebook kernel does not remove this temporary Zarr folder
Rebooting the Ubuntu system cleared the temp Zarr stores
Ah ok I think I understand what is going on with the open_raw use_swap option: Rectangularization of the parsed data can impose a HUGE memory expansion on it, and so before setting the groups/datasets of the Echodata object, one must put them in a temp Zarr store. That is why the rectangularize_data code cannot do something as simple as to_zarr: It is not in the form of an Xarray Dataset yet.

ctuguinay commented 3 months ago

Tested new functionality with a 2 million ping array (1 month of 2017 OOI EK60 data): https://gist.github.com/ctuguinay/8dbf6b89d9a58adbdefc90a63c112cb6

ctuguinay commented 3 months ago

@leewujung This should be ready to review now

ctuguinay commented 3 months ago

Oh one more slight TODO: In the new test, test the use_swap and chunk_dict ds["Sv"] values against Echoview values.

EDIT: Added to both tests that check against Echoview and Matlab

ctuguinay commented 3 months ago

Investigate zarr.sync.ThreadSynchronizer(). Maybe also add some kwarg for storage options.

Edit: Added this

ctuguinay commented 3 months ago

Wouldn't this be equivalent to below?

Oh are you thinking I move the to_zarr to the API wrapper functions (compute_Sv and compute_TS)? I'm not quite sure I understand the question.

And also does this code below also imply the creation of this temporary Zarr Store?

cal_ds = _compute_cal(...)
cal_ds.to_zarr(final_zarr_store)

ctuguinay commented 3 months ago

My question about why even without chunking, using use_swap=True would automatically make the number of layers = 2.

The order of events would go like this (in the case where cal_object is not chunked but use_swap=True:

cal_object is computed and is in memory, and is not a dask array and thus has no dask graph layers
cal_object is sent to Zarr Store
cal_object is lazy loaded back into memory with no actual computation layers attached to it (other than the initial 2 Zarr Store reading layers) since it had no real computation layers to begin with prior to being sent to the Zarr Store

ctuguinay commented 3 months ago

TODOs (from conversation with @leewujung today):

[x] Remove use_swap
[x] Add Warning if the input backscatter_r and backscatter_i (if exists) variable nbyte sum in the Echodata object is above 2 GiB
[x] Move chunk echodata function to class function for Echodata object

leewujung commented 3 months ago

@ctuguinay :

I am wondering if we should put the data size warning limit lower, say 2GB, just so that people are aware of the potential problem.
also, since we'll have to look into the same thing for complex data, how about including both backscatter_r and backscatter_i in the size limit? so the sum of both >2GB. For EK60 the backscatter_i does not exist so some checks are needed.

ctuguinay commented 3 months ago

@leewujung I agree with the above 👍

ctuguinay commented 3 months ago

@leewujung Thanks for the review! This should be ready for review again. Main thing I did was do this backscatter nbyte calculation based on sonar model, encode mode, waveform mode, and echodata beam group.

ctuguinay commented 3 months ago

@leewujung Made your suggestion to extract encode and waveform from the calibration object

OSOceanAcoustics / echopype

Enchance `ep.calibrate.compute_Sv` scalability by enabling EchoData chunking and removing eager computation #1331

Codecov Report