OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
89 stars 70 forks source link

`add_location` fails with duplicated timestamps in GPS data #1294

Closed ashtonflinders closed 2 months ago

ashtonflinders commented 3 months ago

Not sure if this is a code issue or my not understanding the required variables....

import echopype as ep
from echopype import open_raw

raw_file_s3path = "s3://noaa-wcsd-pds/data/raw/Alaska_Knight/NEBS21AK/ES60/L0008-D20210531-T044412-ES60.raw"
ed = open_raw(
   raw_file_s3path, sonar_model='EK60',
   storage_options={'anon': True}
)

Sv = ep.calibrate.compute_Sv(ed)
ep.consolidate.add_location(Sv, ed)

InvalidIndexError Traceback (most recent call last) Cell In[25], line 1 ----> 1 ep.consolidate.add_location(Sv, ed)

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/echopype/utils/prov.py:237, in add_processing_level..wrapper..inner(*args, kwargs) 235 @functools.wraps(func) 236 def inner(*args, *kwargs): --> 237 dataobj = func(args, kwargs) 238 if is_echodata: 239 ed = dataobj

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/echopype/consolidate/api.py:196, in add_location(ds, echodata, nmea_sentence) 194 interp_ds = ds.copy() 195 time_dim_name = list(echodata["Platform"]["longitude"].dims)[0] --> 196 interp_ds["latitude"] = sel_interp("latitude", time_dim_name) 197 interp_ds["longitude"] = sel_interp("longitude", time_dim_name) 198 # Most attributes are attached automatically via interpolation 199 # here we add the history

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/echopype/consolidate/api.py:186, in add_location..sel_interp(var, time_dim_name) 179 return xr.DataArray( 180 data=position_var.values[0] * np.ones(len(ds["ping_time"]), dtype=np.float64), 181 dims=["ping_time"], 182 attrs=position_var.attrs, 183 ) 184 else: 185 # Values may be nan if there are ping_time values outside the time_dim_name range --> 186 return position_var.interp(**{time_dim_name: ds["ping_time"]})

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/xarray/core/dataarray.py:2307, in DataArray.interp(self, coords, method, assume_sorted, kwargs, coords_kwargs) 2303 if self.dtype.kind not in "uifc": 2304 raise TypeError( 2305 f"interp only works for a numeric type array. Given {self.dtype}." 2306 ) -> 2307 ds = self._to_temp_dataset().interp( 2308 coords, 2309 method=method, 2310 kwargs=kwargs, 2311 assume_sorted=assume_sorted, 2312 coords_kwargs, 2313 ) 2314 return self._from_temp_dataset(ds)

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/xarray/core/dataset.py:3976, in Dataset.interp(self, coords, method, assume_sorted, kwargs, method_non_numeric, **coords_kwargs) 3974 if method in ["linear", "nearest"]: 3975 for k, v in validated_indexers.items(): -> 3976 obj, newidx = missing._localize(obj, {k: v}) 3977 validated_indexers[k] = newidx[k] 3979 # optimization: create dask coordinate arrays once per Dataset 3980 # rather than once per Variable when dask.array.unify_chunks is called later 3981 # GH4739

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/xarray/core/missing.py:559, in _localize(var, indexes_coords) 557 maxval = np.nanmax(new_x.values) 558 index = x.to_index() --> 559 imin = index.get_indexer([minval], method="nearest").item() 560 imax = index.get_indexer([maxval], method="nearest").item() 561 indexes[dim] = slice(max(imin - 2, 0), imax + 2)

File ~/opt/anaconda3/envs/echopype/lib/python3.9/site-packages/pandas/core/indexes/base.py:3885, in Index.get_indexer(self, target, method, limit, tolerance) 3882 self._check_indexing_method(method, limit, tolerance) 3884 if not self._index_as_unique: -> 3885 raise InvalidIndexError(self._requires_unique_msg) 3887 if len(target) == 0: 3888 return np.array([], dtype=np.intp)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

ashtonflinders commented 3 months ago

I see that the issue is related to having repeated pairs of entries in the time1 coordinate;

ed["Platform"]["time1"].data
Out[128]: 
array(['2021-05-31T12:44:12.000000000', '2021-05-31T12:44:12.000000000',
       '2021-05-31T12:44:13.000000000', '2021-05-31T12:44:13.000000000',
       '2021-05-31T12:44:14.000000000', '2021-05-31T12:44:14.000000000',
       '2021-05-31T12:44:15.000000000', '2021-05-31T12:44:15.000000000',

Due to having two NMEA sentence strings, each on the same time stamp;

ed["Platform"]["sentence_type"].data
Out[130]: 
array(['GGA', 'RMC', 'GGA', 'RMC', 'GGA', 'RMC', 'GGA', 'RMC', 'GGA',
       'RMC', 'GGA', 'RMC', 'GGA', 'RMC', 'GGA', 'RMC', 'GGA', 'RMC',

Using the provide NMEA sentence flag in add_location fixes the issue. Maybe there should be an error check?

leewujung commented 3 months ago

Maybe there should be an error check?

@ashtonflinders : I think that's a good idea. Pandas provides the actual error check for this. But what we could do here is to print out some statement for what variables to check if the error raise is InvalidIndexError.

@ctuguinay : Since you're working on adding a warning message for add_location for #1286, could you add this in the same PR? Thanks!

ctuguinay commented 3 months ago

@leewujung Will do and sounds good 👍

ctuguinay commented 3 months ago

Oh yeah, and this happened also when organizing our own labels for the hake project and in the same place too. Interesting... @leewujung

ctuguinay commented 2 months ago

Closed by #1296