Parsing error when redundant frequencies are found in EK60 data

oftfrfbf commented 2 years ago

Scanning through some miscellaneous EK60 files I have been finding issues when redundant frequencies are packaged in the raw data.

Example code:

import echopype as ep

raw_file_s3path = f"s3://noaa-wcsd-pds/data/raw/Oscar_Dyson/DY1002/EK60/DY1002_EK60-D20100318-T023008.raw"
ed = ep.open_raw(raw_file_s3path, sonar_model='EK60', storage_options={'anon': True})

Returns:

ValueError: cannot reindex or align along dimension 'frequency' because the index has duplicate values

Looking at the same data with pyEcholab:

from echolab2.instruments import EK60

ek60 = EK60.EK60()
ek60.read_raw("DY1002_EK60-D20100318-T023008.raw")
print(ek60)

I can see that there are 6 channels reported including two 70 kHz channels. My understanding is that physically the same hardware was present throughout the cruise — the likely cause is that the transceiver number in the ER60 software changed for whatever reason ("3-1" to "3-2").

<class 'echolab2.instruments.EK60.EK60'> at 0x7fe5e3322850
    EK60 object contains data from 6 channels:
        1:GPT  18 kHz 009072034d45 1-1 ES18-11
        2:GPT  38 kHz 009072033fa2 2-1 ES38B
        3:GPT  70 kHz 009072058c6c 3-1 ES70-7C
        4:GPT  70 kHz 009072058c6c 3-2 ES70-7C
        5:GPT 120 kHz 00907205794e 4-1 ES120-7C
        6:GPT 200 kHz 0090720346a8 5-1 ES200-7C
    data start time: 2010-03-18T02:30:08.779
      data end time: 2010-03-18T02:47:01.549
    number of pings: 609

The error occurs in set_groups_ek60.py when "ds" and "ds_backscatter" are merged.

https://github.com/OSOceanAcoustics/echopype/blob/1548f46cf7f163188b2858e9a6838290fd25b7ec/echopype/convert/set_groups_ek60.py#L651

Merging the two channels like below works fine because they contain exclusive data but consolidating that back into the main dataset is giving me some trouble. I'll be working on this to see if I can figure out a solution.

xr.merge([ds_backscatter[0], ds_backscatter[1], xr.merge([ds_backscatter[2], ds_backscatter[3]]), ds_backscatter[4], ds_backscatter[5]])

As an aside, I am working on ensuring that as many EK60 files can be processed as possible for the public Water-Column Sonar Data Archive.

I have scanned through the bucket and found a total of 399,105 raw EK60 files. This comprises 288 unique cruises. Processing the first file of each cruise with the latest version of echopype I was able to successfully open files from 210 cruises while 78 encountered some type of exception (such as the issue above).

As it was just the first file of each cruise this isn't a comprehensive metric. I will follow up with other issues I am noticing (and I can share the script if desired).

gavinmacaulay commented 2 years ago

The Simrad echosounders can operate multiple channels at the same frequency, so two 70 kHz channels in one file can happen, but is not that common. In the extreme a file with many identical frequency channels is possible (20-ish is the data collection software limit I think). But the file format allows for practically unlimited channels - I've seen files with several hundred channels at the same frequency.

The file above looks to use the multiplexing feature in Simrad systems where a single transceiver can be switched between two different transducers on alternate pings. The '3-1' and 3-2' in the channel id is the way that the EK60 distinguishes between data channels that use the same transceiver but different transducers.

Merging these two 70 kHz channels is probably not the thing to do in most use cases - users will more likely want to treat them as separate datasets in further analysis.

With regards to not being able to open all first files from the Water-Column Sonar Data Archive, #409 may resolve some of those if you're on a branch without that included.

oftfrfbf commented 2 years ago

Thanks @gavinmacaulay for the explanation. I think I understand and appreciate now the requirement to process the data differently downstream instead of just merging it as I had suggested above.

Is the path forward then to package the data not by frequency but by channel ID, or is there some other solution?

Also, I scanned the files again with the dev branch installed (ep.version of '0.5.5.dev49+g8180da1') and unfortunately did not see any difference regarding the number of files that could be processed.

Results: files_with_exceptions.txt

gavinmacaulay commented 2 years ago

Yes, with Simrad files, channel id is the unique thing - from memory, nothing else in the data files is guaranteed to be unique about the channel. Certainly, using channel id will support better what can be found in Simrad files, but a way to allow for a frequency dimension in multi-channel datasets would be rather useful.

leewujung commented 2 years ago

Yes, I think that using the channel ID would be a more robust way to go. We use that in the Vendor group for the EK80 file, because there is a way better index to select the right filter coefficients for different channels. This will also help with making the matching between transceiver and transducer explicit (the identifier in the config XML can take various forms, e.g. #145, #146), and make the dimensions more consistent across different groups.

Also tagging @emiliom here, because we've been discussing what a more appropriate name is to replace the current frequency dimension -- for narrowband echosounders like EK60 and AZFP this is not a problem, but for broadband echosounders like EK80 using frequency to indicate different channels is misleading.

There's certainly a larger question related to the convention behind this (Beam_group vs frequency/channel and the different forms of multidimensional array representation), but that's probably better as a separate conversation.

leewujung commented 2 years ago

I have scanned through the bucket and found a total of 399,105 raw EK60 files. This comprises 288 unique cruises. Processing the first file of each cruise with the latest version of echopype I was able to successfully open files from 210 cruises while 78 encountered some type of exception (such as the issue above).

As it was just the first file of each cruise this isn't a comprehensive metric. I will follow up with other issues I am noticing (and I can share the script if desired).

@oftfrfbf : It is awesome that you're doing this and thanks for reporting all the issues! It'll be fantastic to resolve them one by one to make the code more robust!

A small suggestion: in your scan it may be useful to also process one of the files in the middle of the cruises. There could often be "tweaks" in the beginning of cruise so later files might be a better representation of what the majority of the files from the cruise look like.

gavinmacaulay commented 2 years ago

@oftfrfbf: some comments on the list of files with exceptions:

'short read' is most likely due to the recording software crashing while writing and the file ending abruptly. #409 fixed one of those, but not all, it appears...
Looks like skipping some NME0 datagrams mucks up the datagram reading sometimes
The NoneType object is not subscriptable suggests that the first ping has no data in it. #409 fixed one of those also...
Finding a TAG datagram seems to cause the 'transceivers' warning?
'list' object has no attribute 'shape' is likely a similar problem to the NoneType one above

oftfrfbf commented 2 years ago

Finding a TAG datagram seems to cause the 'transceivers' warning?

I noticed several other instances where tagged datagrams were processed fine. I can follow up but I think that was an instance where tagging was unrelated to the exception.

leewujung commented 2 years ago

I changed the titled of this to just the parsing part to reflect the content. @b-reyes found an issue with compute_Sv with this particular file and @leewujung will be working on it in another PR.

I'll close this now since #657 is merged and already resolves the parsing problem.

OSOceanAcoustics / echopype

Parsing error when redundant frequencies are found in EK60 data #490