OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
98 stars 73 forks source link

Backward/forward compatibility wrt v0.6.0 group structure/variable name change #606

Closed leewujung closed 2 years ago

leewujung commented 2 years ago

Context

We are coming up against multiple planned breaking changes in v0.6.0. The changes include netCDF group structures as well as coordinates and variable names and attributes.

We know that many users, including both OOI and NCEI have converted some non-trivial volume of data into zarr or nc using echopype v0.5.x, so it is imperative that all processing can accommodate these already converted data.

In light of the group structure changes, we also want to ease the transition for ourselves such that we do not disturb the entire codebase significantly all at once (so that we can focus on convention compliance for v0.6.0).

Problems and Approaches

There are two types of changes we need to deal with:

  1. group structure changes: more involved
  2. variable name changes: more straightforward

Group structure changes

After discussions (#567) we concluded that a cleaner approach is for us to use DataTree functionality to allow accessing the data like a dictionary or like in the netcdf4 library, such as EchoData["Sonar/Beam_groupX"], because our current attribute setup would require EchoData.sonar.beam_group1 which can be quite involved to implement. The downside to this are:

The last one requires some more discussion, so I wonder if an intermediate step we can take in v0.6.0 is to avoid this (for now) by making our code forward compatible to the v0.6.0 data: Basically, we would hook up:

Variable name changes

For these changes, I propose that instead of taking the forward compatibility route as above, here we actually change the variable names in the code once the correct groups are read into xarray datasets.

For example, we would:

Decision?

I want to list out all the changes we need to make in v0.6.0 here, but that requires that we decide what approach to take for the group structure changes problem. So, pinging @emiliom @b-reyes @lsetiawan for inputs!

Once we arrive at a decision I'll add specific tasks to this issue.

b-reyes commented 2 years ago

Thank you @leewujung for putting this together! I have been working on this conversion problem, so I am happy to work on several of these items. I think that once we establish the sensor-specific files in #596, then we can proceed with making the code forward compatible. The following is the first step I would take (input is welcome) to address this issue. Note that this is primarily focused on changes to downstream items related to open_converted.

From my understanding, if we end up implementing datatree, we would only need to modify the above items slightly.

emiliom commented 2 years ago

Wow, thank you @leewujung !! This is really helpful.

Group structure changes

I think what you describe as forward compatibility is what we had agreed on. That is, to temporarily retain (at least in 0.6.0) the current EchoData group referencing (.beam, .sonar, etc) but map them to the 0.6.0 group paths (ie, correct SONAR-netCDF4 v1 paths).

When using open_converted to read a 0.5.x file (which would use open_datatree behind the scenes), the group paths and names would be "wrong" (ie, 0.5.x paths) vis a vis 0.6.x. As @b-reyes discussed above (thanks for the additional details, @b-reyes!), the group mappings we're discussing in #596 (and #598) are intended to address this challenge.

I think there are still challenges we need to work through, and let's discuss them at our meeting tomorrow. For example, all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted applied on a 0.5.x file might not be doable with lazy loading alone.

There are two types of changes we need to deal with:

Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:

leewujung commented 2 years ago

all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted applied on a 0.5.x file might not be doable with lazy loading alone.

I am not sure "what it takes" to rename the coordinate variables (i.e. does the renaming operation requires actually loading data into memory?), but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP as part of open_converted should be fine with lazy loading.

For the third type of changes -- Thanks! I agree this is a different type that should be listed to the main issue description! I'll add it later today.

This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze to remove that extra dimension beam with dim=1.

Fundamental change to the meaning of the frequency dimension (https://github.com/OSOceanAcoustics/echopype/issues/566). It's not just a simple renaming to channel. The frequency information will now be stored somewhere else, in the new frequency_nominal(channel) variable. Code that will need to be changed includes query patterns that relied on the frequency coordinate, and open_raw operations that populate the frequency information.

Agreed. I think this will be a tedious PR thought I don't see immediately major obstacles.

emiliom commented 2 years ago

but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP as part of open_converted should be fine with lazy loading.

100% correct. I'm glad you pointed it out, since it's easy (for me at least) to get lost in the details.

This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze to remove that extra dimension beam with dim=1.

Oh, that's brilliant!!! I hadn't thought of that. Hopefully it will pan out that way.

b-reyes commented 2 years ago

Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:

Thank you for pointing this out @emiliom, I will be sure to keep this in mind. Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing range_bin to range_sample, then we can work together to see the best way to go about the change to channel.

leewujung commented 2 years ago

just implementing the restructuring of the groups and changing range_bin to range_sample

That will be great! This will also let us know if renaming a coordinate requires loading the coordinate data into memory. But I think even if it is, the penalty may not be a problem for the size of data files we are talking about at the moment -- until we fix the problem with combine_echodata to produce a gigantic zarr file (since right now people run out of memory to combine too many files 😅, so that difficult scenario to have to rename the coordinate from a gigantic dataset actually does not happen -- one of the times where we have the blessing of missing functionalities...?).

emiliom commented 2 years ago

Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing range_bin to range_sample then we can work together to see the best way to go about the change to channel

That sounds great, @b-reyes ! Specially the range_bin > range_sample conversion (in open_converted, right?). For the restructuring of the groups, once we settle down on the form of the mapping dict (https://github.com/OSOceanAcoustics/echopype/issues/596#issuecomment-1082471372), you can start working on it. But bear in mind that @lsetiawan said he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.

leewujung commented 2 years ago

But bear in mind that @lsetiawan https://github.com/OSOceanAcoustics/echopype/issues/567#issuecomment-1082324935 he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.

I suggest that we talk this through and agree on who will do what in the call tomorrow morning. There are 2 steps to consider:

  1. open_tree can provide the mechanism for linking both the v0.5.x and v0.6.x files to the current flattened EchoData.GROUP structure without us changing much in the downstream computations (other than coordinate and variable name fixes).

  2. To actually enable the EchoData.grouplabel <> EchoData.getter["group-path"] mapping. Directly using open_tree wouldn't work, since EchoData["Sonar/Beam_groupX"] only works for v0.6.x and EchoData["Beam"] only works for v0.5.x, so something has to happen there.

b-reyes commented 2 years ago

@emiliom yes, once we settle #596 I will start working on this, I plan to follow the outline I created above. I will be sure to discuss this with @lsetiawan in the meeting today.

leewujung commented 2 years ago

Done! Remaining things to be picked up include: