OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
89 stars 70 forks source link

Investigate Non-Parallelized `ep.combine_echodata` #1333

Closed ctuguinay closed 2 weeks ago

ctuguinay commented 3 weeks ago

From #1331 Gist that combined a month's worth of OOI data to test scaled compute_Sv, I noticed the following when running ep.combine_echodata:

All this suggests that the computation may not have been parallelized in this case. I didn't spend too much time looking into this since it was outside of the scope of #1331 so I could be completely wrong about this, but this is probably worth another look.

ctuguinay commented 3 weeks ago

Code used (can also be found in the Gist):

# Combine Echodata Objects
ed_future_list = []
for converted_file in tqdm(sorted(Path('convertallfiles').glob("*.zarr"))):
    ed_future = client.submit(    
        ep.open_converted,
        retries=2,
        converted_raw_path=converted_file,
        chunks={}
    )
    ed_future_list.append(ed_future)
ed_list = client.gather(ed_future_list)
ed_combined = ep.combine_echodata(ed_list)
ctuguinay commented 2 weeks ago

TODO: Close this issue after summarizing conversation OOI ep.combine_echodata example with @leewujung

ctuguinay commented 2 weeks ago

This was resolved with conversation with Wu-Jung: So the initial reason that I thought this was not 'parallelized properly' was that the Dask Dashboard was reloading sets of tasks that were incredibly small, and this done for each Echodata object: image Turns out, this was a set of tasks to lazily append a Zarr Store to an existing Zarr store. This could not be done with all Echodata objects at the same time since coordinating where to properly place to-be-appended Zarr Store arrays while continually appending to the Zarr Store array of interest would be a complicated and most likely inaccurate procedure.

leewujung commented 2 weeks ago

Just to add to the above: