OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
95 stars 73 forks source link

Consider allowing `combine_echodata` to accept different sized `channel` dimensions #819

Closed b-reyes closed 1 year ago

b-reyes commented 1 year ago

In PR #808 it was found that some files from noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH1701/EK60 can have different sized channel dimensions. In PR #808 we will not allow these types of files to be combined, but we do see that there is value in allowing for these types of files to be combined. Thus, we need to discuss the possibility of allowing missing channels amongst the datasets being combined.

From discussion, there appears to be two ways to resolve this:

  1. Allow a user to combine these files, if the files with less channels do not introduce any new channel names. For example, if we have a dataset with 4 channels and another with 3, we can still combine these files as long as the dataset with 3 channels doesn't contain different channels.
    • This particular case would introduce a potentially high number of NaNs as the files with less than the maximum amount of channels would be padded.
  2. Allow combination, but only write those channels contained in the intersection of all channels amongst the Datasets being combined. For example, if we have a Dataset with channels ['a', 'b', 'c'] and another with ['a', 'c'], then the combined Dataset would have the channels ['a', 'c'].
    • This approach has the advantage of not increasing the data size (no NaN padding), however, it introduces the loss of data (since we removed a channel dimension).

If we can't decide on which option is best, we could have a user input to combine that allows the user to choose which of the options above is the most appropriate for their situation.

leewujung commented 1 year ago

If we can't decide on which option is best, we could have a user input to combine that allows the user to choose which of the options above is the most appropriate for their situation.

I am thinking that, in this release (v0.6.3), we can default to raise an error if the EchoData or files to be combined have different number of channels, or if the number of channels is the same but the channel_id do not match exactly.

In a later release, I agree with your suggestion on letting people select which subset of channels they would like to combine or allow to expand. Using the example you have above, if the full set of channels from all files are a, b, c, d, and some files only have a, b, c and some files have a, b, c, d.

Allowing these would be pretty nice, especially the subsetting part. We should caution against using the expansion approach since that would risk creating large dataset.

leewujung commented 1 year ago

In Nov 18 meeting:

b-reyes commented 1 year ago

Based on a discussion between @leewujung and myself, we will first complete the following changes that partially address this issue:

leewujung commented 1 year ago

The below are only for the current implementation that ONLY allows for user choosing a subset of the channels. Expansion is not allowed.

leewujung commented 1 year ago

@b-reyes and I decided that we'll close this issue and create a new one for the "expansion" type of combine operation.