Closed leewujung closed 2 years ago
Thank you @leewujung for putting this together! I have been working on this conversion problem, so I am happy to work on several of these items. I think that once we establish the sensor-specific files in #596, then we can proceed with making the code forward compatible. The following is the first step I would take (input is welcome) to address this issue. Note that this is primarily focused on changes to downstream items related to open_converted
.
__load_file
.
self.group_map.items()
with a sensor-specific dict obtained from #596. From my understanding, if we end up implementing datatree, we would only need to modify the above items slightly.
Wow, thank you @leewujung !! This is really helpful.
Group structure changes
I think what you describe as forward compatibility is what we had agreed on. That is, to temporarily retain (at least in 0.6.0) the current EchoData
group referencing (.beam
, .sonar
, etc) but map them to the 0.6.0 group paths (ie, correct SONAR-netCDF4 v1 paths).
When using open_converted
to read a 0.5.x file (which would use open_datatree behind the scenes), the group paths and names would be "wrong" (ie, 0.5.x paths) vis a vis 0.6.x. As @b-reyes discussed above (thanks for the additional details, @b-reyes!), the group mappings we're discussing in #596 (and #598) are intended to address this challenge.
I think there are still challenges we need to work through, and let's discuss them at our meeting tomorrow. For example, all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted
applied on a 0.5.x file might not be doable with lazy loading alone.
There are two types of changes we need to deal with:
Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:
beam
dimension (#520). EK80 is the only exception, where the change really is just a simple name change to the existing quadrant
dimension.frequency
dimension (#566). It's not just a simple renaming to channel
. The frequency information will now be stored somewhere else, in the new frequency_nominal(channel)
variable. Code that will need to be changed includes query patterns that relied on the frequency
coordinate, and open_raw
operations that populate the frequency information.all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted applied on a 0.5.x file might not be doable with lazy loading alone.
I am not sure "what it takes" to rename the coordinate variables (i.e. does the renaming operation requires actually loading data into memory?), but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP
as part of open_converted
should be fine with lazy loading.
For the third type of changes -- Thanks! I agree this is a different type that should be listed to the main issue description! I'll add it later today.
- Addition of the new beam dimension (https://github.com/OSOceanAcoustics/echopype/issues/520). EK80 is the only exception, where the change really is just a simple name change to the existing quadrant dimension.
This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze
to remove that extra dimension beam
with dim=1.
Fundamental change to the meaning of the frequency dimension (https://github.com/OSOceanAcoustics/echopype/issues/566). It's not just a simple renaming to channel. The frequency information will now be stored somewhere else, in the new frequency_nominal(channel) variable. Code that will need to be changed includes query patterns that relied on the frequency coordinate, and open_raw operations that populate the frequency information.
Agreed. I think this will be a tedious PR thought I don't see immediately major obstacles.
but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP as part of open_converted should be fine with lazy loading.
100% correct. I'm glad you pointed it out, since it's easy (for me at least) to get lost in the details.
This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze to remove that extra dimension beam with dim=1.
Oh, that's brilliant!!! I hadn't thought of that. Hopefully it will pan out that way.
Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:
Thank you for pointing this out @emiliom, I will be sure to keep this in mind. Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing range_bin
to range_sample
, then we can work together to see the best way to go about the change to channel
.
just implementing the restructuring of the groups and changing range_bin to range_sample
That will be great! This will also let us know if renaming a coordinate requires loading the coordinate data into memory. But I think even if it is, the penalty may not be a problem for the size of data files we are talking about at the moment -- until we fix the problem with combine_echodata
to produce a gigantic zarr file (since right now people run out of memory to combine too many files 😅, so that difficult scenario to have to rename the coordinate from a gigantic dataset actually does not happen -- one of the times where we have the blessing of missing functionalities...?).
Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing
range_bin
torange_sample
then we can work together to see the best way to go about the change tochannel
That sounds great, @b-reyes ! Specially the range_bin > range_sample conversion (in open_converted
, right?). For the restructuring of the groups, once we settle down on the form of the mapping dict (https://github.com/OSOceanAcoustics/echopype/issues/596#issuecomment-1082471372), you can start working on it. But bear in mind that @lsetiawan said he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted
to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.
But bear in mind that @lsetiawan https://github.com/OSOceanAcoustics/echopype/issues/567#issuecomment-1082324935 he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.
I suggest that we talk this through and agree on who will do what in the call tomorrow morning. There are 2 steps to consider:
open_tree
can provide the mechanism for linking both the v0.5.x and v0.6.x files to the current flattened EchoData.GROUP
structure without us changing much in the downstream computations (other than coordinate and variable name fixes).
To actually enable the EchoData.grouplabel <> EchoData.getter["group-path"] mapping. Directly using open_tree
wouldn't work, since EchoData["Sonar/Beam_groupX"]
only works for v0.6.x and EchoData["Beam"]
only works for v0.5.x, so something has to happen there.
@emiliom yes, once we settle #596 I will start working on this, I plan to follow the outline I created above. I will be sure to discuss this with @lsetiawan in the meeting today.
Done! Remaining things to be picked up include:
Context
We are coming up against multiple planned breaking changes in v0.6.0. The changes include netCDF group structures as well as coordinates and variable names and attributes.
We know that many users, including both OOI and NCEI have converted some non-trivial volume of data into zarr or nc using echopype v0.5.x, so it is imperative that all processing can accommodate these already converted data.
In light of the group structure changes, we also want to ease the transition for ourselves such that we do not disturb the entire codebase significantly all at once (so that we can focus on convention compliance for v0.6.0).
Problems and Approaches
There are two types of changes we need to deal with:
Group structure changes
After discussions (#567) we concluded that a cleaner approach is for us to use DataTree functionality to allow accessing the data like a dictionary or like in the netcdf4 library, such as
EchoData["Sonar/Beam_groupX"]
, because our current attribute setup would requireEchoData.sonar.beam_group1
which can be quite involved to implement. The downside to this are:open_tree
on v0.5.x data, the groups will not be in the right placeThe last one requires some more discussion, so I wonder if an intermediate step we can take in v0.6.0 is to avoid this (for now) by making our code forward compatible to the v0.6.0 data: Basically, we would hook up:
EchoData.beam
toroot/Sonar/Beam_group1
EchoData.beam_power
(if exists) toroot/Sonar/Beam_group2
EchoData.sonar
still toroot/Sonar
(no action)EchoData.nmea
still toroot/Platform/NMEA
(no action) This way all the downstream functions do not have to change in v0.6.0, and can be dealt with later once we figure out how to use DataTree to handle v0.5.x data.Variable name changes
For these changes, I propose that instead of taking the forward compatibility route as above, here we actually change the variable names in the code once the correct groups are read into xarray datasets.
For example, we would:
range_bin
--> immediately change that torange_sample
--> change all the downstream processing that usesrange_bin
also torange_sample
frequency
--> immediately change that tochannel
--> change all the downstream processing that usesrange_bin
also torange_sample
Decision?
I want to list out all the changes we need to make in v0.6.0 here, but that requires that we decide what approach to take for the group structure changes problem. So, pinging @emiliom @b-reyes @lsetiawan for inputs!
Once we arrive at a decision I'll add specific tasks to this issue.