Investigate Non-Parallelized `ep.combine_echodata`

OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis

https://echopype.readthedocs.io/

Apache License 2.0

89 stars 70 forks source link

Investigate Non-Parallelized `ep.combine_echodata` #1333

Closed ctuguinay closed 2 weeks ago

ctuguinay commented 3 weeks ago

From #1331 Gist that combined a month's worth of OOI data to test scaled compute_Sv, I noticed the following when running ep.combine_echodata:

The Dask Client Dashboard was reloading a new small set of tasks every few seconds
Each small set of tasks only ran a fraction of the workers available (if 64 Dask workers were available, only 1-3 were used)
Memory was kept incredibly low
The runtime of ep.combine_echodata was much longer than the runtime of ep.calibrate.compute_Sv on the product of the aforementioned ep.combine_echodata

All this suggests that the computation may not have been parallelized in this case. I didn't spend too much time looking into this since it was outside of the scope of #1331 so I could be completely wrong about this, but this is probably worth another look.

ctuguinay commented 3 weeks ago

Code used (can also be found in the Gist):

# Combine Echodata Objects
ed_future_list = []
for converted_file in tqdm(sorted(Path('convertallfiles').glob("*.zarr"))):
    ed_future = client.submit(    
        ep.open_converted,
        retries=2,
        converted_raw_path=converted_file,
        chunks={}
    )
    ed_future_list.append(ed_future)
ed_list = client.gather(ed_future_list)
ed_combined = ep.combine_echodata(ed_list)

ctuguinay commented 2 weeks ago

TODO: Close this issue after summarizing conversation OOI ep.combine_echodata example with @leewujung

ctuguinay commented 2 weeks ago

This was resolved with conversation with Wu-Jung: So the initial reason that I thought this was not 'parallelized properly' was that the Dask Dashboard was reloading sets of tasks that were incredibly small, and this done for each Echodata object: Turns out, this was a set of tasks to lazily append a Zarr Store to an existing Zarr store. This could not be done with all Echodata objects at the same time since coordinating where to properly place to-be-appended Zarr Store arrays while continually appending to the Zarr Store array of interest would be a complicated and most likely inaccurate procedure.

leewujung commented 2 weeks ago

Just to add to the above:

the actual action to push data from each individual EchoData zarr to the combined EchoData zarr store happens only when .to_zarr is issued
this separation allows having a lazy-appended combined EchoData object that one can inspect without actually performing the operation, just like a lazy-loaded EchoData object (with all data loaded as dask arrays)