Modify SONAR-netCDF4 convention file

b-reyes commented 2 years ago

This issue is related to #570 and #519, which deal with a restructuring of the EchoData object.

In order to account for this restructuring, we need to allow for the conversion of past EchoData objects into this new form. This is particularly necessary in open_converted.

To make this conversion easier, it is necessary to add some slight modifications to 1.0.yml. We need to either add coords for each group e.g.

groups:
  environment:
    name: Environment
    description: contains information relevant to acoustic propagation through water.
    coords: ['frequency' , 'ping_time']
    ep_group: Environment

or add the coordinates into variable_and_varattributes for each group in 1.0.yml.

If anyone has a different suggestion, please let me know!

@leewujung @emiliom @lsetiawan

leewujung commented 2 years ago

The coordinates currently specified in this yml file is not exhaustive, but for v0.6.0 I suggest that we take a minimalist approach to only take care of breaking changes that users will see or use immediately, such as via open_converted.

Anything that is missing but non-breaking, we can take care of in the following v0.6.X.

b-reyes commented 2 years ago

If we want to take the minimalist approach, then I suggest we just add coords to each group (the first suggestion above). This will allow us to address the change from range_bin --> range_sample and the reorganization of the Beam groups.

leewujung commented 2 years ago

add coords to each group

This would work.

From what I can remember atm, here are what we will need to change/add into the yml:

Beam: Beam_group1
Beam_power when exists : Beam_group2
range_bin: range_sample
frequency: channel_id
Along with the channel_id change, we add a new data variable frequency_nominal to retain the frequency information
range: echo_range (this shouldn't be needed, since this is only after calibration)

@b-reyes : How about if you do the first 3, and @emiliom and I take care of the rest once your first batch of changes are in? since we still need to take care of #566.

b-reyes commented 2 years ago

@leewujung I believe the first two items are already done in 1.0.yml. See the name portion below

https://github.com/OSOceanAcoustics/echopype/blob/1afc16aeb9b7ed67d39006ac9192f13b9d76e4b6/echopype/echodata/convention/1.0.yml#L35-L47

Now that you mentioned items 4, 5, and 6, do you think we should just go with a full blown description of the groups and associated data variables? That way we are not modifying 1.0.yml in a significant way in the future? I know this may overlap with pydantic, but it seems like we could just replace 1.0.yml with pydantic or have it be part of pydantic. For example, we could do the following:

groups:
  environment:
    name: Environment
    description: contains information relevant to acoustic propagation through water.
    ep_group: Environment
    variables:
      absorption_indicative:
        coords: ['frequency', 'ping_time']
      sound_speed_indicative:
        coords: ['frequency', 'ping_time']

Then when we are checking the file we know exactly what to change when doing the conversions.

leewujung commented 2 years ago

I believe the first two items are already done in 1.0.yml.

Oh great! I hadn't looked into the file so that's great!

Now that you mentioned items 4, 5, and 6, do you think we should just go with a full blown description of the groups and associated data variables

I think for v0.6.0 we can aim for having just the coordinate part of every group, to minimize delay to roll out the release, and add description for all data variables in later v0.6.X -- on this second part, @emiliom should definitely chime in here because I think he has something that captures the v.1 convention entirely that could be used/useful in including everything into this yml.

b-reyes commented 2 years ago

Just to clarify @leewujung is your vote for the addition of coords to each group too or do you think this should be done differently (for v0.6.0)?

leewujung commented 2 years ago

Yes, my vote is to add coords to each group (but not the data variables unless those are part of what we have to check).

b-reyes commented 2 years ago

I just realized that the coordinates of a particular group can change depending on the echosounder that you are using. Is this correct? If so, I think we need to be a little more specific when putting the coordinates. For example, we can do the following:

groups:
  environment:
    name: Environment
    description: contains information relevant to acoustic propagation through water.
    ep_group: Environment
    coords_ek60: ['frequency' , 'ping_time']
    coords_ek80: ['ping_time']
    coords_azfp: ['ping_time']
    coords_ad2cp: ['ping_time', 'ping_time_burst', 'ping_time_average', 'ping_time_echosounder']

leewujung commented 2 years ago

I think this would be fine since for only the coordinates this should be manageable.

Coordinates really should not be something we change very often. I think once we are in compliance with v1 the transition to v2 would be smoother. The proposed changes in v2 I know of are more in the addition type, aiming to accommodate new data types (such as ADCP) or clarification/specification (such as transducer orientation).

Note the AD2CP format has not been reviewed carefully (and that is low on the priority list, we should build up the processing capability first for the other echosounders), so as long as thing "work" for that format, we can save the elaboration to later.

emiliom commented 2 years ago

I've gone over #598. I see the convenience of leveraging 1.0.yml to also support the backwards compatibility and mapping need. But, 1.0.yml was intended to be the source of SONAR-netCDF4 v1 truth defined in a single place, independently of sensor idiosyncracies. Adding the sensor-specific coords would change this. Plus, for clarity we'd want to specify what those coords refer to: the echopype 0.5.x coords, the 0.6.x coords, or both (ie, the mapping)? The mapping would be needed if the intent is to enable reading of 0.5.x files and conversion to 0.6.x, which will involve renaming the coords, among other things. For example, for the 0.5.x Beam group in ek80, quadrant will map to beam and frequency will map to channel.

Note also that the beam group mappings from 0.5.x to 0.6.x is also iffy. Here's what the current version (in dev) of 1.0.yml has for the 2 beam groups included (I've removed the description attribute):

  beam:
    name: Sonar/Beam_group1
    ep_group: Sonar/Beam_group1
  beam_power:
    name: Sonar/Beam_group2
    ep_group: Sonar/Beam_group2

It sounds like the intent is to use beam and beam_power as the 0.5.x group names that get mapped to the corresponding entry in ep_group (Sonar/Beam_group1 and Sonar/Beam_group2). But that would fail as is because the names of the 0.5.x groups in the nc/zarr files actually start with an uppercase letter: Beam and Beam_power. That could easily be accommodated in code by capitalizing the first letter, but that's layering one thing on top of another. Also, I think our retention of beam and beam_power in 1.0.yml was intended as transitional.

Ultimately, 1.0.yml should have only one Beam group, b/c that's how the convention is defined: the template for all beam groups is defined once, and it should be reused in as many beam groups as needed. But since we've had a hard-wired reliance on those two possible groups (beam and beam_power) for some critical functionality, the current arrangement helps ease the transition.

Finally, in #574 I started a shift to define sensor-specific information in the set_group_SENSOR.py modules. Currently it's just the beam group name(s) and descriptions. eg, see set_groups_ek80.py#L17.

I'll end this long comment here and start a new one with suggestions.

emiliom commented 2 years ago

I think we may be trying to do too much within this single yml file. A couple of alternatives come to mind:

Have all sensor-specific information (coords and beam groups used) in a single, separate yml file. The necessary 0.5.x > 0.6.x mappings would be here, too.
Have individual yml files, one per sensor. The necessary 0.5.x > 0.6.x mappings would be here, too.
Define these sensor-specific information and mappings as dicts in the set_group_SENSOR.py modules.

I think I lean towards 3. These sensor-specific dicts (whether in yml or Python) would be used in concert with 1.0.yml. The beam group information (names and descriptions) would be defined there rather than in SetGroupsSENSOR.__init__ as I did in #574.

I'll sketch what that dict would look like by tomorrow.

leewujung commented 2 years ago

1.0.yml was intended to be the source of SONAR-netCDF4 v1 truth defined in a single place, independently of sensor idiosyncracies.

Ultimately, 1.0.yml should have only one Beam group, b/c that's how the convention is defined: the template for all beam groups is defined once, and it should be reused in as many beam groups as needed.

I read through everything above and thought about this some more. I agree with @emiliom that the 1.0.yml should stay as the "truth" for SONAR-netCDF4 v1 and be reused as much as possible. This means that the final netCDF data model coming out from the set_X functions will have the exact coordinates and attributes as defined in this yml file.

I also like the option 3 proposed above: to make the 0.5.x to 0.6.x mapping in a dict, and in a sensor-specific form. This will be a nice parallel to our core.py, so nothing complicated to reconcile per my https://github.com/OSOceanAcoustics/echopype/pull/598#issuecomment-1080840144.

Also, maintaining such mapping in dict allows us to be nimble when such changes need to happen in the future, in a way that is completely independent of the 1.0.yml aka "truth from the convention."

Not having to force things in a form like in the yml that is uniform across all groups allows the type of Beam --> Sonar/Beam_group1 and Beam_power --> Sonar/Beam_group2 mapping to happen easily. Once this mapping happened under the hood, content of 1.0.yml will still be used to set the coordinates and attributes in a principled way conformed to the convention.

@emiliom : Thanks a lot for thinking this through and providing thoughtful suggestions. My comment above was stating the obvious or re-iterating what you have said, but I found it useful to write it out again for myself -- assuming that I got everything correctly!

emiliom commented 2 years ago

I also like the option 3 proposed above: to make the 0.5.x to 0.6.x mapping in a dict, and in a sensor-specific form. This will be a nice parallel to our core.py, so nothing complicated to reconcile per my https://github.com/OSOceanAcoustics/echopype/pull/598#issuecomment-1080840144.

I like the analogy to core.py. We could probably reuse the same TYPE_CHECKING scheme.

b-reyes commented 2 years ago

@emiliom thank you for your helpful comments! You bring up some good points. I think it would be more beneficial to separate what we are trying to achieve here from 1.0.yml, as you described. I like the idea of having a separate file/dict for each sensor too. I do lean towards making these yml files rather than dicts in the Python code. I am starting to like yaml files!

emiliom commented 2 years ago

Thanks @b-reyes! We can sketch the content independently of the decision to go with a yml file or a Python dict. I'll go ahead and finish my sketch, and we can take it from there.

emiliom commented 2 years ago

Here's my suggestion for the new, sensor-specific dictionary that supports reading from files created by echopype v0.5.x and provides the mappings to v0.6.x. I took the v0.5.x coords from #598 and made small edits as needed, then added v0.6.x mappings based on simple dimension name changes. This example is for EK60. Some features:

The dict keys correspond to the group labels used in 1.0.yml. Therefore, the two sources of information can be easily combined.
The majority of the dict content involves coords
I've used None or [None] to represent empty or non-existent entries in v0.5.x. We could use other approaches
location_time is a dimension name that may change in v0.6.x. See #518
The top dict entry could be omitted, since it doesn't map anything. Or kept for completeness
I included "beam_power": None, for completeness. Alternatively, it could be omitted
For beam and beam_power entries, the v0.6.x dict item includes the item "group_descr". This will be read in SetGroupsSENSOR.__init__
Given how long this dict is, I think it's best to keep it in a stand-alone, sensor-specific dict (or yml) rather than in a combined dict as done in core.py. IMHO a dict that includes all sensors would be too unwieldy.

This is just a proposal. Plenty of details could be changed.

{
  "top": {
      "v0.5.x": {
          "ep_group": None,
          "coords": [None],
      },
      "v0.6.x": {
          "ep_group": None,
          "coords": [None],
      },
  "environment": {
      "v0.5.x": {
          "ep_group": "Environment",
          "coords": ['frequency', 'ping_time'],
      },
      "v0.6.x": {
          "ep_group": "Environment",
          "coords": ['channel', 'ping_time'],
      },
  },
  "platform": {
      "v0.5.x": {
          "ep_group": "Platform",
          "coords": ['location_time', 'ping_time', 'frequency'],
      },
      "v0.6.x": {
          "ep_group": "Platform",
          "coords": ['location_time', 'ping_time', 'channel'],
      },
  },
  "provenance": {
      "v0.5.x": {
          "ep_group": "Provenance",
          "coords": [None],
      },
      "v0.6.x": {
          "ep_group": "Provenance",
          "coords": [None],
      },
  },
  "sonar": {
      "v0.5.x": {
          "ep_group": "Sonar",
          "coords": [None],
      },
      "v0.6.x": {
          "ep_group": "Sonar",
          "coords": ['beam_group'],
      },
  },
  "vendor": {
      "v0.5.x": {
          "ep_group": "Vendor",
          "coords": ['frequency', 'pulse_length_bin'],
      },
      "v0.6.x": {
          "ep_group": "Vendor",
          "coords": ['channel', 'pulse_length_bin'],
      },
  },
  "beam": {
      "v0.5.x": {
          "ep_group": "Beam",
          "coords": ['frequency', 'ping_time', 'range_bin', None],
      },
      "v0.6.x": {
          "ep_group": "Sonar/Beam_group1",
          "group_descr": (
                    "contains backscatter power (uncalibrated) and other beam or"
                    " channel-specific data, including split-beam angle data when they exist."
                ),
          "coords": ['channel', 'ping_time', 'range_sample', 'beam'],
      },
  },
  "beam_power": None,
  "nmea": {
      "v0.5.x": {
          "ep_group": "Platform/NMEA",
          "coords": ['location_time'],
      },
      "v0.6.x": {
          "ep_group": "Platform/NMEA",
          "coords": ['location_time'],
      },
  },
}

b-reyes commented 2 years ago

@emiliom thank you for the draft of the sensor-specific file. I like the idea of keeping the group top around for completeness. It allows us to change the structure of top in the future (maybe adding metadata). It seems like you have done a good job "encoding" the conversion e.g. in "v0.5.x" you have 'frequency' and in the same index for "v0.6.x" you have 'channel'. I like that you do a similar thing for the groups. We should maybe comment this in the dict or yaml file that we create.

Now that I think about it a little more, would putting this in a yaml file cause issues when we go to the cloud? If there are no issues here, what would be the reason to use a dict, rather than a yaml? Maybe reading a yaml is slow?

Other than the yaml vs. dict discussion I don't really have any concerns. Tomorrow I can try to actually use this form for conversion and then decide if any improvements can be made.

emiliom commented 2 years ago

Regarding dict vs yaml, here are some considerations that come to mind:

As Python code, a dict would be pre-compiled at run time by default. No additional effort (code wise) is needed.
In terms of speed, I'd guess that a dict would be a bit faster than opening and reading an external yaml file. But I don't know for sure.
I think the decision to have things in a yaml file rather than code should have clear motivations. For example:
- User-modified parameters, values and flags that drive execution (as in EchoPro). What non-coding users will interact with
- Stand-alone reference information, especially if it might be used outside the Python package or occasionally edited (spelling mistakes, etc). That's the case of 1.0.yml

b-reyes commented 2 years ago

@emiliom thank you for expanding on the dict vs yaml discussion. I see pros for both approaches. I think that having it as a yaml can be justified by your second sub-bullet above. However, for now, I will make it a dict because I do like the idea that it would be pre-compiled (and might be faster). Maybe we can discuss this further once I do a PR.

I do have thoughts on changing the structure of the dict you outline above though. What do you think about the slight modification below?

{
  "v0.5.x": {
      "top": {
          "ep_group": None,
          "coords": [],
      },
      .
      .
      .
      },
    "v0.6.x": {
    "top": {
        "ep_group": "None,
        "coords": [],
    },
    .
    .
    .
    }
}

The first change I made is to have the versions as the first set of keys, as it might be useful for future use cases. For example, if we name this dict, say VERSION_STRUCTURE, I could just do VERSION_STRUCTURE['v0.5.x'] and have the full structure of the groups. This might be useful later for @lsetiawan, if he decides to use this dict to create the datatree. This may come in handy later on as we start to remove groups, such as beam_power too.

The second change I made was making coords = [], rather than coords = [None]. In my mind, this is more pythonic because it leaves open the possibility of checking if there are no coords with a simple if statement. For example,

if VERSION_STRUCTURE['v0.5.x']['top']['coords']:
    do something with coords

Currently, this would amount to something like the following (which could easily go wrong if the first element is None, but the the rest are filled):

if VERSION_STRUCTURE['v0.5.x']['top']['coords'][0] is not None:
    do something with coords

emiliom commented 2 years ago

The first change I made is to have the versions as the first set of keys, as it might be useful for future use cases.

That looks good to me. I can see how it could be cleaner to get the full structure for a version in one fell swoop as VERSION_STRUCTURE['v0.5.x']. One minor drawback, to my eye, is that it'll place more physical/visual distance between equivalent elements (eg, sonar coords) across versions, but that's just me.

The second change I made was making coords = [], rather than coords = [None]

Definitely agree with your reasoning. FYI, the main reason I suggested None in such cases was for consistency with cases where a coord that is missing in v0.5.x is added in v0.6.x, such as in beam where we have ['frequency', 'ping_time', 'range_bin', None] vs ['channel', 'ping_time', 'range_sample', 'beam']. In such cases it seems critical to have None as a place holder so that the individual mappings are clear; in the example of the beam group, the 4th element, None, maps to beam.

Thanks!

b-reyes commented 2 years ago

One minor drawback, to my eye, is that it'll place more physical/visual distance between equivalent elements

That is a fair point. I do see that it would be very helpful if/when we add future versions. I guess my only concern is that it seems like the groups will/can change between versions. For example, it looks like after datatree has been implemented, we will no longer have beam as an actual group. If this is correct, then it looks like we are basing the structure off of 0.5.x. With the structure I proposed, this would not be an issue.

Something that came to mind, should we remove ep_group and just have the key be the EchoData group? This might be a minor inconvenience to convert from 0.5.x to 0.6.x, but it would be easier after that and would not tie us to the 0.5.x structure.

In such cases it seems critical to have None as a place holder so that the individual mappings are clear

I absolutely agree with this and I think that this is a good idea! I think if we end up adding a coordinate to say top (probably unlikely), then in that situation we should add None to all coords lists.

emiliom commented 2 years ago

Regarding your suggested VERSION_STRUCTURE['v0.5.x'] structure: I'm good with it. I was just commenting on where I was coming from, that's all.

Something that came to mind, should we remove ep_group and just have the key be the EchoData group? This might be a minor inconvenience to convert from 0.5.x to 0.6.x, but it would be easier after that and would not tie us to the 0.5.x structure.

The ep_group is the string-based group path, eg, "Sonar/Beam_group1", while the key (eg, "beam") is the legacy label that also ties things together to the group labels in 1.0.yml and serves as a bridge between 0.5.x and 0.6.x. So, unless I misunderstand what you mean (I'm not sure I follow what you mean by "the EchoData group"), I don't recommend this change.

In such cases it seems critical to have None as a place holder so that the individual mappings are clear

I absolutely agree with this and I think that this is a good idea! I think if we end up adding a coordinate to say top (probably unlikely), then in that situation we should add None to all coords lists.

You're still going with [] rather than [None], right? Like I said, [] looks like a good approach.

b-reyes commented 2 years ago

Regarding your suggested VERSION_STRUCTURE['v0.5.x'] structure: I'm good with it. I was just commenting on where I was coming from, that's all.

Good, I will keep this in the structure.

The ep_group is the string-based group path, eg, "Sonar/Beam_group1", while the key (eg, "beam") is the legacy label that also ties things together to the group labels in 1.0.yml and serves as a bridge between 0.5.x and 0.6.x. So, unless I misunderstand what you mean (I'm not sure I follow what you mean by "the EchoData group"), I don't recommend this change.

What you call ep_group I was calling the EchoData group. I guess my question is, should we tie this structure to the "legacy label"? I have thought about this some more and I see that your structure is really the only way we can "encode" the conversion in the file. Sorry about this little detour, I think you can safely disregard it!

You're still going with [] rather than [None], right? Like I said, [] looks like a good approach.

Yes, I am still going with coords: []. All I am saying is that if in the future we have only one variable in coords that is not contained in prior versions, then in that case, should we modify all prior version coords to coords: [None] or leave it as coords: []?

emiliom commented 2 years ago

Yes, I am still going with coords: []. All I am saying is that if in the future we have only one variable in coords that is not contained in prior versions, then in that case, should we modify all prior version coords to coords: [None] or leave it as coords: []?

Using coords: [] seems fine. Maybe once you start using it in the code you'll have a better sense of whether there are any advantages at all to using coords: [None].

b-reyes commented 2 years ago

@leewujung and @emiliom can we close this since we have decided not to modify this convention file?

leewujung commented 2 years ago

Yes I think it is safe to close this now.

OSOceanAcoustics / echopype

Modify SONAR-netCDF4 convention file #596