Closed leewujung closed 2 years ago
Yes. Since we moved to the current open_raw that only loads into memory, something like this has been missing. It looks like you've identified a nice way forward that leverages the new infrastructure of both open_raw and open_converted. +1
One wrinkle to this functionality is that the nc/zarr files that get created directly through open_raw
ideally should include any additional standardized metadata. Currently we have the platform metadata, but the list of such metadata may grow in the future.
We could add an optional metadata dictionary that would be passed to open_raw and used regardless of whether open_raw serializes the data or not.
I will go ahead and assign this issue to myself. I am currently working on a framework that will make open_raw
delay friendly.
My approach as of right now is as follows:
set_groups
@b-reyes : I am linking this to #489 because this set of approaches actually address the central memory expansion problem that makes echopype not practical to use, and solving it naturally requires under the hood of open_raw
to be delay friendly as we all discussed last two meetings.
In PR #774 we implemented a method that allows users to directly write large variables to a temporary zarr store. This method is currently only available for echosounders that fall under the EK60 or EK80 classification. As the PR only writes a subset of variables to zarr and only applies to EK60/EK80, I believe we only partially address this issue. Should we modify this issue and close it or keep it as is and move it to a different milestone?
@b-reyes and I just talked about this: We think that this should just work, @b-reyes said he will test this and if it just works we'll close this.
I have looked into delaying the process of open_raw
and to_zarr
for a collection of files. In particular, I considered the 19 files obtained from our Watching a solar eclipse example notebook.
When setting open_raw(offload_to_zarr=False)
(i.e. we DO NOT use the functionality introduced in PR #774) I was experiencing some runs where the memory would spike to an unreasonable level for the worker and then an error would be produced. When an error was produced, 1 file would not be written to zarr.
When setting open_raw(offload_to_zarr=True)
(i.e. we DO use the functionality introduced in PR #774) our memory consumption maintained a steady level and I never experienced a scenario where an error was produced.
For future reference, here is a code snippet that implements the core of the runs. For this code, desired_raw_file_paths
are paths to my LOCAL copies of the 19 OOI files. Additionally, here I use Dask Futures instead of dask.delayed
as I noticed a small reduction in runtime, however, memory consumption is similar with dask.delayed
.
def ed_to_zarr(path_to_file, s_model, offload, storage_path, compute):
ed = ep.open_raw(raw_file=path_to_file, sonar_model=s_model,
offload_to_zarr=offload)
# base zarr file name off of raw file name
zarr_name = path_to_file.split('/')[-1].rstrip('.raw') + '.zarr'
ed.to_zarr(storage_path + zarr_name, compute=compute)
# remove the temporary directory and EchoData object
# to ensure the memory is released immediately
if offload:
del ed.parsed2zarr_obj.temp_zarr_dir
del ed
write_path = './OOI_zarrs_ep_ex/temp/'
direct_to_zarr = True
sonar_model = "ek60"
num = len(desired_raw_file_paths)
futures = client.map(ed_to_zarr, desired_raw_file_paths, [sonar_model]*num,
[direct_to_zarr]*num, [write_path]*num, [True]*num)
# block futures to make sure all of them complete
[f.result() for f in futures]
Additionally, I used Dask's MemorySampler to get the memory consumption of the above code. For these results, I was using echopype v0.6.2, a MacBook Pro with an Apple M1 Pro chip, and a local cluster with 5 workers, 10 cores, and 16 GiB of memory.
Awesome, it's great to see that the new functionality in v0.6.2 just works for delay/parallelizing the conversion! I think we can close this issue and utilize the results here and refine on the workflow side (I.e. in the recipe idea I mentioned a few times, we can make parallel conversion an option in the config for folks who need and have resources to do so.)
One thing we may want to revisit at some point later is the to_netcdf counterpart, but let's prioritize the combine_echodata issues, since there's the NCZarr implementation already.
Currently if we delay
ed = open_raw(...)
, the returned EchoData objects contains multiple delayed version of set_group methods under the hood, because the operations are to organize data into nice xarray DataSets and return them to corresponding attributes of the EchoData object (e.g.,ed.beam
,ed.platform
, etc). This is not super useful when we want to handle lots of raw binary files together (for example in the large data file case in #407).But a potentially more useful delayed EchoData object is of the form where each group/attribute is a lazy-loaded xarray DataSet directly linked to the file/object on the local filesystem or cloud. We already have this available, through
ed = open_converted(...)
, when an EchoData object is created from already converted netCDF or zarr files.Can we add an option in
open_raw
that under the hood 1) parse the binary data, 2) save parsed data from memory to file, and 3) read them back in usingopen_converted
, so that we get an EchoData object that can be nicely delayed?This will allow us to parallelize the combine EchoData object method (#383) easily.
Tagging @lsetiawan @imranmaj @emiliom for inputs.