Consider allowing `combine_echodata` to accept different sized `channel` dimensions

b-reyes commented 1 year ago

In PR #808 it was found that some files from noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH1701/EK60 can have different sized channel dimensions. In PR #808 we will not allow these types of files to be combined, but we do see that there is value in allowing for these types of files to be combined. Thus, we need to discuss the possibility of allowing missing channels amongst the datasets being combined.

From discussion, there appears to be two ways to resolve this:

Allow a user to combine these files, if the files with less channels do not introduce any new channel names. For example, if we have a dataset with 4 channels and another with 3, we can still combine these files as long as the dataset with 3 channels doesn't contain different channels.
- This particular case would introduce a potentially high number of NaNs as the files with less than the maximum amount of channels would be padded.
Allow combination, but only write those channels contained in the intersection of all channels amongst the Datasets being combined. For example, if we have a Dataset with channels ['a', 'b', 'c'] and another with ['a', 'c'], then the combined Dataset would have the channels ['a', 'c'].
- This approach has the advantage of not increasing the data size (no NaN padding), however, it introduces the loss of data (since we removed a channel dimension).

If we can't decide on which option is best, we could have a user input to combine that allows the user to choose which of the options above is the most appropriate for their situation.

leewujung commented 1 year ago

If we can't decide on which option is best, we could have a user input to combine that allows the user to choose which of the options above is the most appropriate for their situation.

I am thinking that, in this release (v0.6.3), we can default to raise an error if the EchoData or files to be combined have different number of channels, or if the number of channels is the same but the channel_id do not match exactly.

In a later release, I agree with your suggestion on letting people select which subset of channels they would like to combine or allow to expand. Using the example you have above, if the full set of channels from all files are a, b, c, d, and some files only have a, b, c and some files have a, b, c, d.

subset: e.g., user selects to combine a, b, c, so the resulting data set will have only channels a, b, c
expand: e.g., user decides to NaN-pad and make the entire data set having channels a, b, c, d

Allowing these would be pretty nice, especially the subsetting part. We should caution against using the expansion approach since that would risk creating large dataset.

leewujung commented 1 year ago

In Nov 18 meeting:

We decided that we will implement the "subset" only first, since allow for the subsetting is important for our current needs and it should be straightforward to implement.
The "expansion" will take a bit more work and easily risks killing the user's machine, so let's delay that for now.

b-reyes commented 1 year ago

Based on a discussion between @leewujung and myself, we will first complete the following changes that partially address this issue:

Add the keyword argument channel_selection to combine_echodata that has a datatype of None or a List
By default we will let channel_selection = None and have a RuntimeError if all of the channels are not the same amongst the files being combined.
- If a RuntimeError is encountered, it may be nice to print out the subset of channels that are in all of the files
If channel_selection is a list, it will specify what channels the user wants in the final combined EchoData object
- If these selected channels are not in all of the files this scenario would require an expansion (i.e. NaN padding) and we will not implement this scenario right now. We will raise a NotImplementedError in this case.

leewujung commented 1 year ago

The below are only for the current implementation that ONLY allows for user choosing a subset of the channels. Expansion is not allowed.

EK60 / AZFP:
- there should always be only a single beam group under Sonar that is called Sonar/Beam_group1
- user only provide a list, or their dictionary should have a key called "Sonar/Beam_group1"
- the list will be used for all groups that have a channel dimension
EK80:
- input will be of the form:
```
{
"Sonar/Beam_group1": ["a", "b", "c"],
"Sonar/Beam_group2": ["c", "d", "e"],
}
```
  or only 1 list, which will be interpreted as both keys having the same values
- Environment group: channel dimension does not exist
- comes from Environment datagram so does not have channel
- Provenance group: channel dimension does not exist
- Platform group: union of the 2 lists if there are 2
- Sonar group: union of the 2 lists if there are 2
- Vendor_specific:
- channel: union of the 2 lists if there are 2
- cal_channel_id: maintain current behavior: this is treated as constant and the dataset with the largest length will be used, so nothing will be done

leewujung commented 1 year ago

@b-reyes and I decided that we'll close this issue and create a new one for the "expansion" type of combine operation.

OSOceanAcoustics / echopype

Consider allowing `combine_echodata` to accept different sized `channel` dimensions #819