Tracking issue for: rename dimensions in groups, including `frequency` to `channel` and `*_time` to `time*`, AZFP variable movements

leewujung commented 2 years ago

Currently we use frequency as a dimension in the echopype-generated dataset (in place of having data from different frequency channels saved into different Beam_groupXs defined in the SONAR-netCDF4 convention ver. 1.0).

However, frequency can be a misnomer and is probably better names as channel, because for broadband echosounders like the EK80, there are a sweep of frequencies in one channel. Using frequency with numerical datatype can also be problematic, such as exposed in #490 where there are more than one transducers at the same frequency but share the same multiplexed transceiver. The good thing is that we can easily slice/select data based on the numerical number that is the nominal frequency of that channel.

So, there are 2 questions, which we have to decide now since this would cause breaking changes and would best be in v0.6.0:

What should we call this dimension?
What datatype should we use for the values in that dimension?

Seeking comments from @emiliom @lsetiawan @b-reyes @gavinmacaulay 🙏

Updates: tracking issues/PRs

[x] #657
[x] #656
[x] #642
[x] #649
[x] #660
[x] #663

leewujung commented 2 years ago

Thoughts on top of my head:

For 1 -- channel may work well as the name of the dimension/coordinate.

For 2 -- We can add a coordinate nominal_frequency along the same dimension to help with the data selection/slicing need.

gavinmacaulay commented 2 years ago

For 1: channel is the term most often used for this. For 2: the most generic label for this dimension would be a unique channel identification string or channel id number. I imagine you could retain a nominal frequency 'coordinate', which then wouldn't need to be unique across channels.

emiliom commented 2 years ago

For 2 -- We can add a coordinate nominal_frequency along the same dimension to help with the data selection/slicing need.

Would that be a coordinate, or a regular "data" variable with channel dimension, but not a coordinate?

leewujung commented 2 years ago

For 2 -- We can add a coordinate nominal_frequency along the same dimension to help with the data selection/slicing need.

Would that be a coordinate, or a regular "data" variable with channel dimension, but not a coordinate?

Oh that's true. If it's a coordinate we'll run into the same non-monotonic/duplicate problem again. So, making it a data variable seems would work?

There are also multiple frequency data variables so along with this let's revisit them based on the conversation with @gavinmacaulay.

leewujung commented 2 years ago

For 2: the most generic label for this dimension would be a unique channel identification string or channel id number.

Right now we have a channel_id data variable in converted EK80 files to select the right set of filter coefficients. I don't remember on top of my head whether this data variable also exist in converted EK60 files, but probably not; and definitely not in AZFP or AD2CP files.

It seems a good idea for us to consider adding this as part of the standard set of data variables for all sonar models.

emiliom commented 2 years ago

Here's the consensus I see emerging about this, plus my own thoughts:

New dimension name to replace frequency: channel
Corresponding coordinate: channel:
- Type: string, for maximum flexibility
- attributes: long_name: "Vendor channel ID"
Associated data variable: nominal_frequency
- Alternative names, for greater consistency with v1 convention:
- frequency_nominal: nominal is used only as a suffix in the v1 convention
- frequency: already used as a dimension in the Environment group
- Type: float
- attributes: same as currently used by echopype in the Beam.frequency coordinate and as described in the Environment.frequency coordinate in the convention

This does mean that slicing on frequency would not be as convenient as it is now in echopype. But it sounds like this is unavoidable, given that the same frequency can occur in >1 channel.

emiliom commented 2 years ago

@leewujung have you had a chance to think about this?

leewujung commented 2 years ago

@emiliom : thanks for pinging, this one somehow slipped through my inbox.

I'll do point-by-point response so that we can arrive at a conclusion here.

New dimension name to replace frequency: channel

Agree

Corresponding coordinate: channel:

Type: string, for maximum flexibility

attributes: long_name: "Vendor channel ID"

Agree

Associated data variable: nominal_frequency

Alternative names, for greater consistency with v1 convention:

frequency_nominal: nominal is used only as a suffix in the v1 convention

frequency: already used as a dimension in the Environment group

I vote for frequency_nominal for consistency with v1 convention.

Also, your mention of Environment makes me realize that this change will also impact other groups that share the same dimension. In that way actually channel is better, but for completeness it would be good to add frequency_nominal as well -- I am thinking about the case for absorption, people may be confused by there are different values but the presence of different frequencies makes it self-explanatory.

Type: float

attributes: same as currently used by echopype in the Beam.frequency coordinate and as described in the Environment.frequency coordinate in the convention

Agree

leewujung commented 2 years ago

In terms of what should be in channel:

For EK60 and EK80, I think we can use the channel_id variable, which contains strings like "GPT 18 kHz 009072034d45 1-1 ES18-11", "WBT 545603-15 120-7C".
For AZFP, this probably needs to be a constructed string, combining the serial number (something like "55075") and the channel frequency. I haven't seen any files with duplicated frequencies, but in theory that is also possible. Since there's no unique identifier I can find in the file (I could check with manufacturer), we could probably do "55075-125-1", "55075-125-2", ..., where 125 is the frequency in kHz (that's the unit they use in the associated XML file), and 1, 2, etc is the sequence number in "Frequencies" in the XML file (it is sequential, so nice that way).

emiliom commented 2 years ago

Let's clarify what name we're proposing for the channel dimension and coordinate. I had suggested channel, and @leewujung I think you had agreed. In your subsequent comment, you brought up the channel_id variable as the source of the values for the channel coordinate for EK60 and EK80.

But we're still in agreement that the name of the dimension and coordinate will be channel, right? I'm double checking because of what you stated here vis a vis channel_id.

leewujung commented 2 years ago

Yeah, I think i was confused in that comment since channel_id is specific for EK80. Let's use channel for this to be more general.

emiliom commented 2 years ago

Great

emiliom commented 2 years ago

Tasks (updated)

In the Beam_groupX groups:

New channel dimension and coordinate variable
- Rename the frequency dimension and coordinate variable to channel
- Attributes: long_name: "Vendor channel ID"
- Use a string data type
- Set the value to (see comments above and below):
- EK60 and EK80: use the content of the current channel_id variable, which contains unique strings like "GPT 18 kHz 009072034d45 1-1 ES18-11", "WBT 545603-15 120-7C".
- AZFP: use serial number + frequency + sequence number in the XML file. "55075-125-1", "55075-125-2", ..., where 55075 is the serial number, 125 is the frequency in kHz (that's the unit they use in the associated XML file), and 1, 2, etc is the sequence number in "Frequencies" in the XML file (it is sequential, so nice that way).
- AD2CP: Does not have frequency as a dimension
Create a new float64 data-type variable, frequency_nominal(channel), to contain the frequency value. It can be populated the same way the current frequency coordinate variable is populated. Its attributes should be the same as in the current frequency coordinate variable.
Modify all relevant query/selection patterns currently based on frequency.

@leewujung recommended a phased approach, where I think the first step is just renaming frequency to frequency_nominal. @leewujung can you flesh this out?

A side note: frequency is also used in other groups. I assume we don't need to change any of those? I haven't looked closely.

b-reyes commented 2 years ago

@emiliom thank you for putting this together. I am starting to work on this.

A side note: frequency is also used in other groups. I assume we don't need to change any of those? I haven't looked closely.

Here is where frequency occurs in each sensor (not including the beam group)

EK60
- Environment --> absorption_indicative, sound_speed_indicative
- Platform --> pitch, roll, vertical_offset, water_level, transducer_offset_x/y/z, MRU_offset_x/y/z, MRU_rotation_x/y/z, position_offset_x/y/z
- Vendor --> sa_correction, gain_correction, pulse_length
EK80
- Vendor --> sa_correction, gain_correction, pulse_length
- Platform --> water_level, transducer_offset_x/y/z, MRU_offset_x/y/z, MRU_rotation_x/y/z, position_offset_x/y/z
AZFP
- Vendor --> XML_transmit_duration_nominal, XML_gain_correction, XML_digitization_rate, XML_lockout_index, digitization_rate, lockout_index, number_of_bins_per_channel, number_of_samples_per_average_bin, board_number, data_type
- Based on #642 the following variables will be moved and have frequency as a dimension: DS, EL, TVR, VTX, Sv_offset, number_of_samples_digitized_per_pings, number_of_digitized_samples_averaged_per_pings
- AD2CP
  - Does not have frequency as a dimension in any group

leewujung commented 2 years ago

@emiliom for your comment above could you just link to my https://github.com/OSOceanAcoustics/echopype/issues/566#issuecomment-1067564872 to avoid confusion? (because I think what you listed was not accurate) -- thanks.

for EK60 AND EK80: channel_id is present and unique, so let's use that for values in this new coordinate variable channel
for AZFP, let's use serial number + frequency + sequence number in the XML file
- "55075-125-1", "55075-125-2", ..., where 55075 is the serial number, 125 is the frequency in kHz (that's the unit they use in the associated XML file), and 1, 2, etc is the sequence number in "Frequencies" in the XML file (it is sequential, so nice that way).

leewujung commented 2 years ago

@leewujung recommended a phased approach, where I think the first step is just renaming frequency to frequency_nominal. @leewujung can you flesh this out?

Below is what I was thinking, see what you guys think:

add the channel coordinate as we intend as the final form and rename the current frequency to frequency_nominal (many tests will break)
assign frequency_nominal as a coordinate variable (in 1. it is a data variable)
do swap_dims({"channel": "frequency_nominal"}) so that all variables are aligned to frequency_nominal --> this will allow the .sel and .isel slicing operations
replace all the frequency by frequency_nominal in all occurrences of the old slicing .sel(frequency=X) or .isel(frequency=X) (tests in theory should pass -- this is to assure us that there has been no mechanistic change of the code, just a change of variable name)
remove the swap_dims in 4. and replace the slicing to be using channel (tests should continue to pass)
remove the assign_coord in 2 so that the datasets are in their final form (tests should continue to pass)

5-6. involve some reversion, so this may not be faster. I personally feel this may be more error proof though, but @b-reyes see what you think because you'll be the person implementing these changes. @emiliom please do chime in on what you think!

emiliom commented 2 years ago

@emiliom for your comment above could you just link to my https://github.com/OSOceanAcoustics/echopype/issues/566#issuecomment-1067564872 to avoid confusion? (because I think what you listed was not accurate) -- thanks.

I was trying to have all the instructions in one place. Thanks for catching my mistake. I got confused by your comment "Yeah, I think i was confused in that comment since channel_id is specific for EK80.". Anyway, I'll update my summary and also add a link to the source comment.

emiliom commented 2 years ago

We don't need to change anything in the Vendor group, since that falls outside the scope of the convention.

b-reyes commented 2 years ago

To get a better sense of things, I am trying to understand the slicing operation. I believe there is some mention of it above, but I wanted to have it said explicitly here. Currently, we do .sel(frequency=X) or .isel(frequency=X). Once we change the name frequency --> channel and create the data variable frequency_nominal(channel) from frequency, what will this operation look like? Is it just .sel(channel=X) and .isel(channel=X)?

leewujung commented 2 years ago

@b-reyes : I think you can look into the code to get that sense, and propose other solutions if needed; there may also be places where this slicing can be eliminated. :)

b-reyes commented 2 years ago

While renaming the current frequency dimension in the Beam groups to channel, it was found that this creates conflicts within the calibration methods (see _get_gain_for_complex). To rectify these conflicts, I suggest that we also rename frequency in the Vendor group to channel. If this change is made, I also suggest that we add frequency_nominal to the Vendor group so that one can reference the frequencies.

@leewujung @emiliom

leewujung commented 2 years ago

I also suggest that we add frequency_nominal to the Vendor group so that one can reference the frequencies.

I think this is a good idea.

b-reyes commented 2 years ago

Similar to Vendor, the Environment group variable ~sound_indicative~ sound_speed_indicative is connected to the beam variables because it is used to calculate range_meter. For this reason, I believe we need to also change it's dimension frequency to channel and add the variable frequency_nominal.

Unfortunately, absorption_indicative (also in the Environment group) is in the convention and it is required that it have a dimension of frequency. How do we handle this clashing of Echopype and the convention?

edited: changed sound_indicative to sound_speed_indicative

leewujung commented 2 years ago

@b-reyes : I am not sure what is sound_indicative? If you are talking about sound speed, it is not frequency dependent in our context, but depending on where the data comes from it can be with different frequency channels in EK60 -- please check that. I keep on stressing where data comes from because that tells you why things are a certain way for EK60 vs EK80 difference, so please check those.

I don't believe we would have absorption_indicative anywhere that does not have a dimension already align with frequency. Absorption is frequency dependent and either comes from the data that way or will be computed that way, so there should not be any clash.

leewujung commented 2 years ago

Also, just as an overarching thing, it probably helps to be explicit here that I think we should change everything that had frequency as a dimension before to be aligned with channel. Again most of those came from the instrument-generated files, so check the set_groups to make sure why different instruments may have difference. All these things are connected either from how the instrument stores the file or physics, so it does not make sense to me if we only change things in the Beam_groupX group and not others.

b-reyes commented 2 years ago

I am not sure what is sound_indicative ?

@leewujung yes, I meant sound_speed_indicative, sorry about that, please see my edit above.

I think it is fine to change the dimension frequency to channel for sound_speed_indicative as it does not have any dimensions in the convention. For EK60 sound_speed_indicative currently has the dimensions (frequency, ping_time) and is (ping_time) for EK80.

I agree that it would make sense to change frequency to channel in all groups. However, I have specifically mentioned the Environment group and absorption_indicative because in the convention this variable has dimension (frequency). Based on our past discussions, we had mentioned that we should not take away dimensions from the convention, but we can add them. Using this logic, if we change frequency to channel, then we would have to let absorption_indicative have the dimensions (channel, frequency, ping_time) for EK60. I know this probably doesn't make sense, but I wanted to bring it up. If we are going to intentionally disregard the convention, then I think the reasoning for that should be openly discussed. For the record, I don't like absorption_indicative having the dimensions (channel, frequency).

gavinmacaulay commented 2 years ago

Note that in the convention the frequency dimension specified for absorption_indicative has no restrictions on it and is not required to have just the frequencies of the acoustic data channels in the file. This was to allow the storing of an absorption/frequency curve that could be interpolated against to derive absorptions for particular echosounder frequencies (or frequency bands for broadband data).

A related note is that the convention was not designed to store, without loss, data files from existing sonar systems. It was rather intended to specify the minimum required data and metadata to allow derivation of quantitative Sv and TS. A compromise with this is that in some cases this prevents/hinders the creation of a lossless replica of a sonar file.

leewujung commented 2 years ago

Conclusions from meeting just now:

frequency_nominal should be included in all groups
frequency_nominal can be different for different Beam_groupX

emiliom commented 2 years ago

@b-reyes could you confirm that the changes that will happen within Beam groups are now fully clear? My impression is that there are no open questions there.

emiliom commented 2 years ago

Decisions for groups other than Beam groups.

AD2CP does not currently use a frequency dimension in any group other than Beam_group1; no changes will be applied to groups other than Beam_group1. For all other instruments:

`Vendor` group

change frequency to channel wherever it occurs, and add frequency_nominal
Based on #642 the following Beam_group1 variables will be moved in AZFP: DS, EL, TVR, VTX, Sv_offset, number_of_samples_digitized_per_pings, number_of_digitized_samples_averaged_per_pings.

`Environment` group

Changes apply only to EK60 and EK80. AZFP is unchanged.
Rename existing, single time dimension to time1. See #656.
change frequency to channel wherever it occurs, and add frequency_nominal
dimensionality of convention-defined variables: absorption_indicative(channel, time1) and sound_speed_indicative(time1) . @leewujung please confirm the latter; I'm not sure if we agreed to retain a (channel, time1) dimensionality instead, as currently done in EK60 under different dimension names

`Platform` group

Rename existing time dimensions as specified in #656, if needed. I don't think anything needs to change at this time, except in EK80, the comment attribute for time3.
change frequency to channel wherever it occurs, and add frequency_nominal
Some other changes to existing variables, including some addition of missing variables, will be required. See #649

NOTE: Some of these changes should be done as individual PR's. For example, the move of AZFP variables to the Vendor group

leewujung commented 2 years ago

dimensionality of convention-defined variables: absorption_indicative(channel, time1) and sound_speed_indicative(time1) . @leewujung please confirm the latter; I'm not sure if we agreed to retain a (channel, time1) dimensionality instead, as currently done in EK60 under different dimension names

I think we can safely "squeeze" the channel dimension out for EK60. Perhaps we can put in a check to make sure all the values are the same when doing that, and spit out an error if the values across the frequency channels are not the same.

b-reyes commented 2 years ago

@b-reyes could you confirm that the changes that will happen within Beam groups are now fully clear? My impression is that there are no open questions there.

Yes, all changes for the beam groups are clear (please see the below clarification).

AD2CP does not currently use a frequency dimension in any group other than Beam_group1; no changes will be applied to groups other than Beam_group1. For all other instruments:

As previously mentioned, frequency should not appear anywhere within AD2CP. Not even Beam_group1 has the frequency dimension.

dimensionality of convention-defined variables: absorption_indicative(channel, time1) and sound_speed_indicative(time1) . @leewujung please confirm the latter; I'm not sure if we agreed to retain a (channel, time1) dimensionality instead, as currently done in EK60 under different dimension names

I think we can safely "squeeze" the channel dimension out for EK60. Perhaps we can put in a check to make sure all the values are the same when doing that, and spit out an error if the values across the frequency channels are not the same.

@emiliom as you noted, several of these items should be done in different PRs/issues within this issue, I suggest we not change the the number of dimensions of sound_speed_indicative for EK60 in this issue. I think we should just change frequency --> channel for sound_speed_indicative for EK60 giving sound_speed_indicative(channel, ping_time). Then deal with other changes once this issue has been dealt with. I think we are already off topic as the title of this issue is "Rename the current frequency dimension in the Beam groups".

emiliom commented 2 years ago

I suggest we not change the the number of dimensions of sound_speed_indicative for EK60 in this issue. I think we should just change frequency --> channel for sound_speed_indicative for EK60 giving sound_speed_indicative(channel, ping_time). Then deal with other changes once this issue has been dealt with. I think we are already off topic as the title of this issue is "Rename the current frequency dimension in the Beam groups".

This issue does cover a lot of ground, beyond the original title! But I thought you wanted to see a clear list of the changes that should occur -- our targets. That's what I was aiming for. I believe we're already including changes to dimensions in order to bring greater alignment across instruments, whenever possible. For example, as you noted earlier, EK60 and EK80 currently differ: "For EK60 sound_speed_indicative currently has the dimensions (frequency, ping_time) and is (ping_time) for EK80."

What sequence of changes/PR's is used to accomplish these targets is a different ball of wax. eg, https://github.com/OSOceanAcoustics/echopype/issues/566#issuecomment-1109034399

b-reyes commented 2 years ago

This issue does cover a lot of ground, beyond the original title! But I thought you wanted to see a clear list of the changes that should occur -- our targets. That's what I was aiming for.

@emiliom I found your list very helpful. Do you think it would be better to have this as its own issue and have each of those items as tasks? I was thinking we could do something similar to #520.

emiliom commented 2 years ago

@emiliom I found your list very helpful.

Great

Do you think it would be better to have this as its own issue and have each of those items as tasks? I was thinking we could do something similar to https://github.com/OSOceanAcoustics/echopype/issues/520.

If by "this" you mean that list, I think I lean towards keeping it in this same issue. That is, feel free to edit my list in place to turn it into tasks. Rework it a bit if you think there's a way to organize it that lends itself better to tasks and PRs.

emiliom commented 2 years ago

@leewujung I don't know why #663 was being tracked in this issue. #663 will be closed soon anyways, but I think we can close this issue.

leewujung commented 2 years ago

I think it is because the frequency to channel will impact what's being selected to plot for the DVM. And yes we can close this now!

OSOceanAcoustics / echopype