OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
95 stars 73 forks source link

Adding an option to make EchoData from `open_raw` delay-friendly #408

Closed leewujung closed 2 years ago

leewujung commented 3 years ago

Currently if we delay ed = open_raw(...), the returned EchoData objects contains multiple delayed version of set_group methods under the hood, because the operations are to organize data into nice xarray DataSets and return them to corresponding attributes of the EchoData object (e.g., ed.beam, ed.platform, etc). This is not super useful when we want to handle lots of raw binary files together (for example in the large data file case in #407).

But a potentially more useful delayed EchoData object is of the form where each group/attribute is a lazy-loaded xarray DataSet directly linked to the file/object on the local filesystem or cloud. We already have this available, through ed = open_converted(...), when an EchoData object is created from already converted netCDF or zarr files.

Can we add an option in open_raw that under the hood 1) parse the binary data, 2) save parsed data from memory to file, and 3) read them back in using open_converted, so that we get an EchoData object that can be nicely delayed?

This will allow us to parallelize the combine EchoData object method (#383) easily.

Tagging @lsetiawan @imranmaj @emiliom for inputs.

emiliom commented 3 years ago

Yes. Since we moved to the current open_raw that only loads into memory, something like this has been missing. It looks like you've identified a nice way forward that leverages the new infrastructure of both open_raw and open_converted. +1

emiliom commented 3 years ago

One wrinkle to this functionality is that the nc/zarr files that get created directly through open_raw ideally should include any additional standardized metadata. Currently we have the platform metadata, but the list of such metadata may grow in the future.

We could add an optional metadata dictionary that would be passed to open_raw and used regardless of whether open_raw serializes the data or not.

b-reyes commented 2 years ago

I will go ahead and assign this issue to myself. I am currently working on a framework that will make open_raw delay friendly.

My approach as of right now is as follows:

  1. Check if the expanded form of the data is too large for memory
  2. If the expanded form is too big, write large data variables to a temporary zarr store in chunks
  3. Lazy load the large data variables from the temporary zarr store into a dask array.
  4. Create an xarray DataArray from the dask arrays and assign them to the appropriate groups within set_groups
leewujung commented 2 years ago

@b-reyes : I am linking this to #489 because this set of approaches actually address the central memory expansion problem that makes echopype not practical to use, and solving it naturally requires under the hood of open_raw to be delay friendly as we all discussed last two meetings.

489 also links to the two problematic files for your testing. Note the expansion for the 89 MB file is particularly crazy.

b-reyes commented 2 years ago

In PR #774 we implemented a method that allows users to directly write large variables to a temporary zarr store. This method is currently only available for echosounders that fall under the EK60 or EK80 classification. As the PR only writes a subset of variables to zarr and only applies to EK60/EK80, I believe we only partially address this issue. Should we modify this issue and close it or keep it as is and move it to a different milestone?

leewujung commented 2 years ago

@b-reyes and I just talked about this: We think that this should just work, @b-reyes said he will test this and if it just works we'll close this.

b-reyes commented 2 years ago

I have looked into delaying the process of open_raw and to_zarr for a collection of files. In particular, I considered the 19 files obtained from our Watching a solar eclipse example notebook.

When setting open_raw(offload_to_zarr=False) (i.e. we DO NOT use the functionality introduced in PR #774) I was experiencing some runs where the memory would spike to an unreasonable level for the worker and then an error would be produced. When an error was produced, 1 file would not be written to zarr.

When setting open_raw(offload_to_zarr=True) (i.e. we DO use the functionality introduced in PR #774) our memory consumption maintained a steady level and I never experienced a scenario where an error was produced.

For future reference, here is a code snippet that implements the core of the runs. For this code, desired_raw_file_paths are paths to my LOCAL copies of the 19 OOI files. Additionally, here I use Dask Futures instead of dask.delayed as I noticed a small reduction in runtime, however, memory consumption is similar with dask.delayed.

def ed_to_zarr(path_to_file, s_model, offload, storage_path, compute):

    ed = ep.open_raw(raw_file=path_to_file, sonar_model=s_model, 
                     offload_to_zarr=offload)

    # base zarr file name off of raw file name
    zarr_name = path_to_file.split('/')[-1].rstrip('.raw') + '.zarr'

    ed.to_zarr(storage_path + zarr_name, compute=compute)

    # remove the temporary directory and EchoData object 
    # to ensure the memory is released immediately 
    if offload:
        del ed.parsed2zarr_obj.temp_zarr_dir

    del ed

write_path = './OOI_zarrs_ep_ex/temp/'
direct_to_zarr = True
sonar_model = "ek60"
num = len(desired_raw_file_paths)

futures = client.map(ed_to_zarr, desired_raw_file_paths, [sonar_model]*num,
                         [direct_to_zarr]*num, [write_path]*num, [True]*num)

# block futures to make sure all of them complete 
[f.result() for f in futures]

Additionally, I used Dask's MemorySampler to get the memory consumption of the above code. For these results, I was using echopype v0.6.2, a MacBook Pro with an Apple M1 Pro chip, and a local cluster with 5 workers, 10 cores, and 16 GiB of memory. mem_prof_dask_futures_19_ooi_files

leewujung commented 2 years ago

Awesome, it's great to see that the new functionality in v0.6.2 just works for delay/parallelizing the conversion! I think we can close this issue and utilize the results here and refine on the workflow side (I.e. in the recipe idea I mentioned a few times, we can make parallel conversion an option in the config for folks who need and have resources to do so.)

One thing we may want to revisit at some point later is the to_netcdf counterpart, but let's prioritize the combine_echodata issues, since there's the NCZarr implementation already.