Closed b-reyes closed 2 years ago
The coordinates currently specified in this yml file is not exhaustive, but for v0.6.0 I suggest that we take a minimalist approach to only take care of breaking changes that users will see or use immediately, such as via open_converted
.
Anything that is missing but non-breaking, we can take care of in the following v0.6.X.
If we want to take the minimalist approach, then I suggest we just add coords
to each group (the first suggestion above). This will allow us to address the change from range_bin
--> range_sample
and the reorganization of the Beam groups.
add coords to each group
This would work.
From what I can remember atm, here are what we will need to change/add into the yml:
Beam
: Beam_group1
Beam_power
when exists : Beam_group2
range_bin
: range_sample
frequency
: channel_id
channel_id
change, we add a new data variable frequency_nominal
to retain the frequency informationrange
: echo_range
(this shouldn't be needed, since this is only after calibration)@b-reyes : How about if you do the first 3, and @emiliom and I take care of the rest once your first batch of changes are in? since we still need to take care of #566.
@leewujung I believe the first two items are already done in 1.0.yml
. See the name
portion below
Now that you mentioned items 4, 5, and 6, do you think we should just go with a full blown description of the groups and associated data variables? That way we are not modifying 1.0.yml
in a significant way in the future? I know this may overlap with pydantic, but it seems like we could just replace 1.0.yml
with pydantic or have it be part of pydantic. For example, we could do the following:
groups:
environment:
name: Environment
description: contains information relevant to acoustic propagation through water.
ep_group: Environment
variables:
absorption_indicative:
coords: ['frequency', 'ping_time']
sound_speed_indicative:
coords: ['frequency', 'ping_time']
Then when we are checking the file we know exactly what to change when doing the conversions.
I believe the first two items are already done in 1.0.yml.
Oh great! I hadn't looked into the file so that's great!
Now that you mentioned items 4, 5, and 6, do you think we should just go with a full blown description of the groups and associated data variables
I think for v0.6.0 we can aim for having just the coordinate part of every group, to minimize delay to roll out the release, and add description for all data variables in later v0.6.X -- on this second part, @emiliom should definitely chime in here because I think he has something that captures the v.1 convention entirely that could be used/useful in including everything into this yml.
Just to clarify @leewujung is your vote for the addition of coords
to each group too or do you think this should be done differently (for v0.6.0)?
Yes, my vote is to add coords
to each group (but not the data variables unless those are part of what we have to check).
I just realized that the coordinates of a particular group can change depending on the echosounder that you are using. Is this correct? If so, I think we need to be a little more specific when putting the coordinates. For example, we can do the following:
groups:
environment:
name: Environment
description: contains information relevant to acoustic propagation through water.
ep_group: Environment
coords_ek60: ['frequency' , 'ping_time']
coords_ek80: ['ping_time']
coords_azfp: ['ping_time']
coords_ad2cp: ['ping_time', 'ping_time_burst', 'ping_time_average', 'ping_time_echosounder']
I think this would be fine since for only the coordinates this should be manageable.
Coordinates really should not be something we change very often. I think once we are in compliance with v1 the transition to v2 would be smoother. The proposed changes in v2 I know of are more in the addition type, aiming to accommodate new data types (such as ADCP) or clarification/specification (such as transducer orientation).
Note the AD2CP format has not been reviewed carefully (and that is low on the priority list, we should build up the processing capability first for the other echosounders), so as long as thing "work" for that format, we can save the elaboration to later.
I've gone over #598. I see the convenience of leveraging 1.0.yml
to also support the backwards compatibility and mapping need. But, 1.0.yml
was intended to be the source of SONAR-netCDF4 v1 truth defined in a single place, independently of sensor idiosyncracies. Adding the sensor-specific coords would change this. Plus, for clarity we'd want to specify what those coords refer to: the echopype 0.5.x coords, the 0.6.x coords, or both (ie, the mapping)? The mapping would be needed if the intent is to enable reading of 0.5.x files and conversion to 0.6.x, which will involve renaming the coords, among other things. For example, for the 0.5.x Beam
group in ek80, quadrant
will map to beam
and frequency
will map to channel
.
Note also that the beam group mappings from 0.5.x to 0.6.x is also iffy. Here's what the current version (in dev
) of 1.0.yml
has for the 2 beam groups included (I've removed the description
attribute):
beam:
name: Sonar/Beam_group1
ep_group: Sonar/Beam_group1
beam_power:
name: Sonar/Beam_group2
ep_group: Sonar/Beam_group2
It sounds like the intent is to use beam
and beam_power
as the 0.5.x group names that get mapped to the corresponding entry in ep_group
(Sonar/Beam_group1
and Sonar/Beam_group2
). But that would fail as is because the names of the 0.5.x groups in the nc/zarr files actually start with an uppercase letter: Beam
and Beam_power
. That could easily be accommodated in code by capitalizing the first letter, but that's layering one thing on top of another. Also, I think our retention of beam
and beam_power
in 1.0.yml was intended as transitional.
Ultimately, 1.0.yml should have only one Beam group, b/c that's how the convention is defined: the template for all beam groups is defined once, and it should be reused in as many beam groups as needed. But since we've had a hard-wired reliance on those two possible groups (beam
and beam_power
) for some critical functionality, the current arrangement helps ease the transition.
Finally, in #574 I started a shift to define sensor-specific information in the set_group_SENSOR.py
modules. Currently it's just the beam group name(s) and descriptions. eg, see set_groups_ek80.py#L17.
I'll end this long comment here and start a new one with suggestions.
I think we may be trying to do too much within this single yml file. A couple of alternatives come to mind:
set_group_SENSOR.py
modules.I think I lean towards 3. These sensor-specific dicts (whether in yml or Python) would be used in concert with 1.0.yml. The beam group information (names and descriptions) would be defined there rather than in SetGroupsSENSOR.__init__
as I did in #574.
I'll sketch what that dict would look like by tomorrow.
1.0.yml was intended to be the source of SONAR-netCDF4 v1 truth defined in a single place, independently of sensor idiosyncracies.
Ultimately, 1.0.yml should have only one Beam group, b/c that's how the convention is defined: the template for all beam groups is defined once, and it should be reused in as many beam groups as needed.
I read through everything above and thought about this some more. I agree with @emiliom that the 1.0.yml
should stay as the "truth" for SONAR-netCDF4 v1 and be reused as much as possible. This means that the final netCDF data model coming out from the set_X
functions will have the exact coordinates and attributes as defined in this yml file.
I also like the option 3 proposed above: to make the 0.5.x to 0.6.x mapping in a dict, and in a sensor-specific form. This will be a nice parallel to our core.py
, so nothing complicated to reconcile per my https://github.com/OSOceanAcoustics/echopype/pull/598#issuecomment-1080840144.
Also, maintaining such mapping in dict allows us to be nimble when such changes need to happen in the future, in a way that is completely independent of the 1.0.yml
aka "truth from the convention."
Not having to force things in a form like in the yml that is uniform across all groups allows the type of Beam
--> Sonar/Beam_group1
and Beam_power
--> Sonar/Beam_group2
mapping to happen easily. Once this mapping happened under the hood, content of 1.0.yml
will still be used to set the coordinates and attributes in a principled way conformed to the convention.
@emiliom : Thanks a lot for thinking this through and providing thoughtful suggestions. My comment above was stating the obvious or re-iterating what you have said, but I found it useful to write it out again for myself -- assuming that I got everything correctly!
I also like the option 3 proposed above: to make the 0.5.x to 0.6.x mapping in a dict, and in a sensor-specific form. This will be a nice parallel to our core.py, so nothing complicated to reconcile per my https://github.com/OSOceanAcoustics/echopype/pull/598#issuecomment-1080840144.
I like the analogy to core.py
. We could probably reuse the same TYPE_CHECKING scheme.
@emiliom thank you for your helpful comments! You bring up some good points. I think it would be more beneficial to separate what we are trying to achieve here from 1.0.yml
, as you described. I like the idea of having a separate file/dict for each sensor too. I do lean towards making these yml
files rather than dicts
in the Python code. I am starting to like yaml files!
Thanks @b-reyes! We can sketch the content independently of the decision to go with a yml
file or a Python dict
. I'll go ahead and finish my sketch, and we can take it from there.
Here's my suggestion for the new, sensor-specific dictionary that supports reading from files created by echopype v0.5.x and provides the mappings to v0.6.x. I took the v0.5.x coords
from #598 and made small edits as needed, then added v0.6.x mappings based on simple dimension name changes. This example is for EK60. Some features:
1.0.yml
. Therefore, the two sources of information can be easily combined.coords
None
or [None]
to represent empty or non-existent entries in v0.5.x. We could use other approacheslocation_time
is a dimension name that may change in v0.6.x. See #518top
dict entry could be omitted, since it doesn't map anything. Or kept for completeness"beam_power": None,
for completeness. Alternatively, it could be omittedbeam
and beam_power
entries, the v0.6.x dict item includes the item "group_descr". This will be read in SetGroupsSENSOR.__init__
dict
(or yml
) rather than in a combined dict
as done in core.py. IMHO a dict
that includes all sensors would be too unwieldy.This is just a proposal. Plenty of details could be changed.
{
"top": {
"v0.5.x": {
"ep_group": None,
"coords": [None],
},
"v0.6.x": {
"ep_group": None,
"coords": [None],
},
"environment": {
"v0.5.x": {
"ep_group": "Environment",
"coords": ['frequency', 'ping_time'],
},
"v0.6.x": {
"ep_group": "Environment",
"coords": ['channel', 'ping_time'],
},
},
"platform": {
"v0.5.x": {
"ep_group": "Platform",
"coords": ['location_time', 'ping_time', 'frequency'],
},
"v0.6.x": {
"ep_group": "Platform",
"coords": ['location_time', 'ping_time', 'channel'],
},
},
"provenance": {
"v0.5.x": {
"ep_group": "Provenance",
"coords": [None],
},
"v0.6.x": {
"ep_group": "Provenance",
"coords": [None],
},
},
"sonar": {
"v0.5.x": {
"ep_group": "Sonar",
"coords": [None],
},
"v0.6.x": {
"ep_group": "Sonar",
"coords": ['beam_group'],
},
},
"vendor": {
"v0.5.x": {
"ep_group": "Vendor",
"coords": ['frequency', 'pulse_length_bin'],
},
"v0.6.x": {
"ep_group": "Vendor",
"coords": ['channel', 'pulse_length_bin'],
},
},
"beam": {
"v0.5.x": {
"ep_group": "Beam",
"coords": ['frequency', 'ping_time', 'range_bin', None],
},
"v0.6.x": {
"ep_group": "Sonar/Beam_group1",
"group_descr": (
"contains backscatter power (uncalibrated) and other beam or"
" channel-specific data, including split-beam angle data when they exist."
),
"coords": ['channel', 'ping_time', 'range_sample', 'beam'],
},
},
"beam_power": None,
"nmea": {
"v0.5.x": {
"ep_group": "Platform/NMEA",
"coords": ['location_time'],
},
"v0.6.x": {
"ep_group": "Platform/NMEA",
"coords": ['location_time'],
},
},
}
@emiliom thank you for the draft of the sensor-specific file. I like the idea of keeping the group top
around for completeness. It allows us to change the structure of top
in the future (maybe adding metadata). It seems like you have done a good job "encoding" the conversion e.g. in "v0.5.x"
you have 'frequency'
and in the same index for "v0.6.x"
you have 'channel'
. I like that you do a similar thing for the groups. We should maybe comment this in the dict
or yaml
file that we create.
Now that I think about it a little more, would putting this in a yaml file cause issues when we go to the cloud? If there are no issues here, what would be the reason to use a dict
, rather than a yaml
? Maybe reading a yaml
is slow?
Other than the yaml
vs. dict
discussion I don't really have any concerns. Tomorrow I can try to actually use this form for conversion and then decide if any improvements can be made.
Regarding dict
vs yaml
, here are some considerations that come to mind:
@emiliom thank you for expanding on the dict
vs yaml
discussion. I see pros for both approaches. I think that having it as a yaml
can be justified by your second sub-bullet above. However, for now, I will make it a dict because I do like the idea that it would be pre-compiled (and might be faster). Maybe we can discuss this further once I do a PR.
I do have thoughts on changing the structure of the dict you outline above though. What do you think about the slight modification below?
{
"v0.5.x": {
"top": {
"ep_group": None,
"coords": [],
},
.
.
.
},
"v0.6.x": {
"top": {
"ep_group": "None,
"coords": [],
},
.
.
.
}
}
The first change I made is to have the versions as the first set of keys, as it might be useful for future use cases. For example, if we name this dict
, say VERSION_STRUCTURE
, I could just do VERSION_STRUCTURE['v0.5.x']
and have the full structure of the groups. This might be useful later for @lsetiawan, if he decides to use this dict
to create the datatree. This may come in handy later on as we start to remove groups, such as beam_power
too.
The second change I made was making coords = []
, rather than coords = [None]
. In my mind, this is more pythonic because it leaves open the possibility of checking if there are no coords
with a simple if statement. For example,
if VERSION_STRUCTURE['v0.5.x']['top']['coords']:
do something with coords
Currently, this would amount to something like the following (which could easily go wrong if the first element is None, but the the rest are filled):
if VERSION_STRUCTURE['v0.5.x']['top']['coords'][0] is not None:
do something with coords
The first change I made is to have the versions as the first set of keys, as it might be useful for future use cases.
That looks good to me. I can see how it could be cleaner to get the full structure for a version in one fell swoop as VERSION_STRUCTURE['v0.5.x']
. One minor drawback, to my eye, is that it'll place more physical/visual distance between equivalent elements (eg, sonar
coords
) across versions, but that's just me.
The second change I made was making coords = [], rather than coords = [None]
Definitely agree with your reasoning. FYI, the main reason I suggested None
in such cases was for consistency with cases where a coord that is missing in v0.5.x is added in v0.6.x, such as in beam
where we have ['frequency', 'ping_time', 'range_bin', None]
vs ['channel', 'ping_time', 'range_sample', 'beam']
. In such cases it seems critical to have None
as a place holder so that the individual mappings are clear; in the example of the beam
group, the 4th element, None
, maps to beam
.
Thanks!
One minor drawback, to my eye, is that it'll place more physical/visual distance between equivalent elements
That is a fair point. I do see that it would be very helpful if/when we add future versions. I guess my only concern is that it seems like the groups will/can change between versions. For example, it looks like after datatree has been implemented, we will no longer have beam
as an actual group. If this is correct, then it looks like we are basing the structure off of 0.5.x. With the structure I proposed, this would not be an issue.
Something that came to mind, should we remove ep_group
and just have the key be the EchoData group? This might be a minor inconvenience to convert from 0.5.x to 0.6.x, but it would be easier after that and would not tie us to the 0.5.x structure.
In such cases it seems critical to have
None
as a place holder so that the individual mappings are clear
I absolutely agree with this and I think that this is a good idea! I think if we end up adding a coordinate to say top
(probably unlikely), then in that situation we should add None
to all coords
lists.
Regarding your suggested VERSION_STRUCTURE['v0.5.x']
structure: I'm good with it. I was just commenting on where I was coming from, that's all.
Something that came to mind, should we remove
ep_group
and just have the key be the EchoData group? This might be a minor inconvenience to convert from 0.5.x to 0.6.x, but it would be easier after that and would not tie us to the 0.5.x structure.
The ep_group
is the string-based group path, eg, "Sonar/Beam_group1", while the key (eg, "beam") is the legacy label that also ties things together to the group labels in 1.0.yml and serves as a bridge between 0.5.x and 0.6.x. So, unless I misunderstand what you mean (I'm not sure I follow what you mean by "the EchoData group"), I don't recommend this change.
In such cases it seems critical to have None as a place holder so that the individual mappings are clear
I absolutely agree with this and I think that this is a good idea! I think if we end up adding a coordinate to say top (probably unlikely), then in that situation we should add None to all coords lists.
You're still going with []
rather than [None]
, right? Like I said, []
looks like a good approach.
Regarding your suggested VERSION_STRUCTURE['v0.5.x'] structure: I'm good with it. I was just commenting on where I was coming from, that's all.
Good, I will keep this in the structure.
The ep_group is the string-based group path, eg, "Sonar/Beam_group1", while the key (eg, "beam") is the legacy label that also ties things together to the group labels in 1.0.yml and serves as a bridge between 0.5.x and 0.6.x. So, unless I misunderstand what you mean (I'm not sure I follow what you mean by "the EchoData group"), I don't recommend this change.
What you call ep_group
I was calling the EchoData group. I guess my question is, should we tie this structure to the "legacy label"? I have thought about this some more and I see that your structure is really the only way we can "encode" the conversion in the file. Sorry about this little detour, I think you can safely disregard it!
You're still going with
[]
rather than[None]
, right? Like I said,[]
looks like a good approach.
Yes, I am still going with coords: []
. All I am saying is that if in the future we have only one variable in coords
that is not contained in prior versions, then in that case, should we modify all prior version coords
to coords: [None]
or leave it as coords: []
?
Yes, I am still going with coords: []. All I am saying is that if in the future we have only one variable in coords that is not contained in prior versions, then in that case, should we modify all prior version coords to coords: [None] or leave it as coords: []?
Using coords: []
seems fine. Maybe once you start using it in the code you'll have a better sense of whether there are any advantages at all to using coords: [None]
.
@leewujung and @emiliom can we close this since we have decided not to modify this convention file?
Yes I think it is safe to close this now.
This issue is related to #570 and #519, which deal with a restructuring of the
EchoData
object.In order to account for this restructuring, we need to allow for the conversion of past
EchoData
objects into this new form. This is particularly necessary inopen_converted
.To make this conversion easier, it is necessary to add some slight modifications to 1.0.yml. We need to either add
coords
for each group e.g.or add the coordinates into
variable_and_varattributes
for each group in1.0.yml
.If anyone has a different suggestion, please let me know!
@leewujung @emiliom @lsetiawan