OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
95 stars 73 forks source link

Unordered `channel` coordinate #815

Closed b-reyes closed 1 year ago

b-reyes commented 1 year ago

Recently I attempted to combine narrowband EK80 files from noaa-wcsd-pds. While doing this I found that the files had the same channels, but the ordering of these channels were different. For example, the channel coordinate had the following values for two different raw files:

['WBT 400142-15 ES70-7C_ES' 'WBT 400143-15 ES38B_ES'
 'WBT 400141-15 ES18_ES' 'WBT 400140-15 ES120-7C_ES'
 'WBT 400145-15 ES200-7C_ES']

['WBT 400143-15 ES38B_ES' 'WBT 400142-15 ES70-7C_ES'
 'WBT 400140-15 ES120-7C_ES' 'WBT 400145-15 ES200-7C_ES'
 'WBT 400141-15 ES18_ES']

From my understanding, this is purely caused by how we handle the parsed data (@emiliom or @leewujung can you confirm this?). In set_groups, we create datasets by looping over the channels. However, we never sort these channels. This can be easily seen in EK80 when setting the Platform group:

https://github.com/OSOceanAcoustics/echopype/blob/361d44ee319dbe25b610759cc8c6e42d4f469135/echopype/convert/set_groups_ek80.py#L218-L228

For the new combine routine, it is imperative that all Datasets being combined have the channel coordinate in the same order.

To rectify this situation, I propose that we review set_groups and parsed_to_zarr and make sure that channels are always sorted in ascending order.

Note: I don't think we can order the class object self.parser_obj.ch_ids because of how we utilize it in various situations (see example above).

As I need this to be fixed to continue with the combine routine, I will go ahead and work on this.

leewujung commented 1 year ago

From my understanding, this is purely caused by how we handle the parsed data

I think it is probably fair to say that it is determined by what the order is in the first set of datagrams for all channels. This order can change depending on which one is sent by the instrument first.

For the new combine routine, it is imperative that all Datasets being combined have the channel coordinate in the same order.

I think it is possible to short circuit this by figuring out the channel sequence when you have the individual datasets/groups, and just write that in the "right" order, since you can control which part of the final zarr array you are filling in and the coordinates are loaded for the individual datasets/groups whether lazy-loaded or not.

That said, the sorting shouldn't take too much time, and it is a good thing to have.

Note: I don't think we can order the class object self.parser_obj.ch_ids because of how we utilize it in various situations (see example above).

For the example you referenced or the other cases when you only save the broadband or narrowband channels into different groups, I think you can sort the final set of ch_ids under each group and assemble the dataset according to that ~, and use xr.Dataset.transpose to sort the already assembled dataset under that group (which is according to the channel sequence of that particular .raw file) into the correct order~ before saving.

As you said, we should not change whatever the order in self.parser_obj -- that will likely break many things!

b-reyes commented 1 year ago

I think it is possible to short circuit this by figuring out the channel sequence when you have the individual datasets/groups, and just write that in the "right" order, since you can control which part of the final zarr array you are filling in and the coordinates are loaded for the individual datasets/groups whether lazy-loaded or not.

Yes, I agree that it is not absolutely necessary and we could do what you are suggesting here. The downside is this may not be efficient because it could introduce more IO. I suggest we avoid this situation unless a user specifically asks for this type of functionality to be added (I could see something like this being useful).

That said, the sorting shouldn't take too much time, and it is a good thing to have.

I agree, hopefully it will not take too much time and it will improve the consistency of the EchoData creation. I will go ahead and work on this.

leewujung commented 1 year ago

This has been addressed in #818.