Convert large AD2CP file

leewujung commented 10 months ago

This is originally from #407, but focus of that issue was turned to EK echosounder data instead. Issues related to EK files were addressed in #1185.

This new issue is to capture the similar needs for AD2CP files. The same approach as in #1185 would likely work here too, with the caveat that some part of it may need to be at the parser stage if the file is very very big. Say an AD2CP has a volume of a few GB, and system memory is small, then parser.parse_raw may fail due to insufficient memory.

From #407:

During OceanHackWeek'21 a participant attempted to convert ~1GB files from Nortek AD2CP (Signature 100 and Signature 250), and failed both on the Jupyter Hub and on their personal machine. This is probably related to the exploding in-memory xarray.merge situation that we have seen.

@imranmaj @lsetiawan and I discussed this yesterday, and an interim solution is to have an under the hood procedure to do the following:

parse binary data

if parsed binary data reaches a certain size, save already parsed data into a small file

repeat 1-2 until end of binary file

merge all small files under the hood, which can be delayed and distributed to dask workers.

Looking back:

Step 1-3 could use the same approach as done for EK data to save parsed data into a temp zarr file.
Step 4 could use the same xr.concat approach done for EK data to avoid potentially large overhead for xr.merge.

A caveat here is that, without parsing all AD2CP data packets (aka datagrams in EK raw files), the "final" shape of the entire zarr store may change across the batches of sequentially parsed data packets. Some work is needed here to figure out a strategy to handle this.

jessecusack commented 4 months ago

@leewujung has there been any progress on this issue? I have a ~ 4GB ad2cp file from a signature 100 instrument that I cannot convert. After a few minutes I get some warnings "UserWarning: Converting non-nanosecond precision datetime..." followed by a notice that the process was killed and another warning "UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown". My laptop has 16GB ram and should be able to handle the conversion no problem. I was monitoring the process memory usage and it was not excessive.

Is there a workaround in the meantime?

leewujung commented 4 months ago

Hey @jessecusack : We haven't been able to work on this further because we're over-committed with other priorities.

Would you be interested in working on it? If I remember correctly from prior investigations, the main thing that creates the large memory expansion was the xr.merge, which we can side step by changing the handling of how data from different modes are stored, and probably also writing parsed data to disk directly. File size and memory are not always a one to one match, depending on the computation details involved.

For "UserWarning: Converting non-nanosecond precision datetime..." -- this is something we know how to fix, as we've fixed that for other echosounder models.

For "UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown", could you copy-paste the entire error message, or better yet, upload a notebook gist so that there's a reproducible example?

OSOceanAcoustics / echopype

Convert large AD2CP file #1221