OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
95 stars 73 forks source link

echodata.compute_range() returns full matrix while some backscatter entries are NaN-padded #483

Closed gavinmacaulay closed 2 years ago

gavinmacaulay commented 2 years ago

The EK80 test file, D20170912-T234910.raw, has two channels (70 and 200 kHz) with different sample rates (about a x3 difference). When using the echodata.compute_range() function, the returned range vector for the 70 kHz channel is incorrect - it has the same number of range values as the 200 kHz channel, when it should have about a third less. Due to how the range vector is calculated, the actual ranges in the 70 kHz vector are, per-sample, a third of what they should be. The dropna() method used when extracting the Sv at 70 kHz doesn't work for the 70 kHz range vector due to this.

Code, derived from test_calibrate.py, that shows this is:

from pathlib import Path
import echopype as ep

ek80_path = Path('./test_data/ek80')
ek80_raw_path = ek80_path.joinpath('D20170912-T234910.raw')

echodata = ep.open_raw(ek80_raw_path, sonar_model='EK80')
ds_Sv = ep.calibrate.compute_Sv(echodata, waveform_mode='BB', encode_mode='complex')

Sv_70k = ds_Sv.Sv.isel(frequency=0, ping_time=0).dropna('range_bin').values

Sv_range = echodata.compute_range(env_params=[], ek_waveform_mode='BB')
Sv_70k_range = Sv_range.isel(frequency=0, ping_time=0).values
Sv_200k_range = Sv_range.isel(frequency=1, ping_time=0).values

print(f'Backscatter data at 70 kHz has {Sv_70k.shape[0]} samples')
print(f'But range vector at 70 kHz has {Sv_70k_range.shape[0]} samples')
print(f'Range vector at 200 kHz has {Sv_200k_range.shape[0]} samples')

print(f'Range at max sample for 70 kHz is {Sv_70k_range[-1]:.2f} m')
print(f'Range at max sample for 200 kHz is {Sv_200k_range[-1]:.2f} m')
leewujung commented 2 years ago

@gavinmacaulay : I looked into this and think the calculation itself is correct. The actual range vector (in meters) in the computed Sv dataset is stored in the variable range, and it is calculated using echodata.compute_range() under the hood. When range is used in conjunction with the computed Sv data, only the first 8284 range samples will be used, since there is no data after that for the 70kHz channel (NaN-padded dataarray).

The number of samples in your Sv_70k_range is based on the NaN-padded dataarray, so have 25228 samples, just like Sv_200k_range. The difference is in the NaN-padded, actual Sv variable.

That said, I can see that we can avoid this confusion by setting all samples in range that corresponds to the NaN samples in the 70k Sv to also be NaN, so that even though the length of the Sv_70k_range length is still 25228, the values it contains will ends at the 8284th sample.

Thoughts?

leewujung commented 2 years ago

Another subtlety here is that each ping may not have the same number of non-NaN samples, even though for most datasets the max range (before the first NaN sample) is the same. This is why the output of echodata.compute_range() (or ds_Sv.range) has the same dimensions.