OSOceanAcoustics / echopype

Enabling interoperability and scalability in ocean sonar data analysis
https://echopype.readthedocs.io/
Apache License 2.0
94 stars 73 forks source link

Improve and add documentation for `_clean_ping_time` #798

Closed b-reyes closed 1 year ago

b-reyes commented 2 years ago

In PR #297 we created the function _clean_ping_time, but did not include documentation. This function involves a recursive call, so it is not apparent what is happening. After discussing this with @leewujung, we decided that it is important to include some documentation for this function.

Additionally, @leewujung do you think it is possible to remove local_win_len as an input and determine it within _clean_ping_time? I think that would make using this function less error-prone.

b-reyes commented 2 years ago

@leewujung I was thinking about the _clean_ping_time function and I have constructed an alternative approach that avoids recursion, improves readability, and removes the need for a window.

The goal of this code is to create ascending values and preserve the step size (wherever possible). Below is my code/example to do this. If you think it is a good candidate to replace _clean_ping_time, I can create a PR that fully implements it. Note: I quickly implemented it, so if you see any issues, please let me know.

import numpy as np

arr = np.array([0,1,2,3,4,2,3,4,6,8,10,11,12,13,15,17,19,13,15,17,21])
arr_diff = np.diff(arr)

print(f"arr = {arr}")
print(f"arr_diff = {arr_diff} \n")

# get indices of arr_diff with negative values                                                        
neg_ind = np.argwhere(arr_diff < 0).flatten()

# TODO: account for neg_ind + 1 = len(arr_diff)                                                       
# TODO: do we need to account for neg diff next to eachother?                                         
# replace negative difference with difference one index after                                         
arr_diff[neg_ind] = arr_diff[neg_ind + 1]

# perform cummulative sum of differences after 1st neg index                                          
c_diff = np.cumsum(arr_diff[neg_ind[0]:], axis=0)

# create new array that preserves differences, but enforces increasing vals                           
new_arr = arr[:]
new_arr[neg_ind[0]+1:] = new_arr[neg_ind[0]] + c_diff
print(f"new_arr = {new_arr}")
print(f"np.diff(new_arr) = {np.diff(new_arr)}")

The output of the above example:

arr = [ 0  1  2  3  4  2  3  4  6  8 10 11 12 13 15 17 19 13 15 17 21]
arr_diff = [ 1  1  1  1 -2  1  1  2  2  2  1  1  1  2  2  2 -6  2  2  4] 

new_arr = [ 0  1  2  3  4  5  6  7  9 11 13 14 15 16 18 20 22 24 26 28 32]
np.diff(new_arr) = [1 1 1 1 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 4]
leewujung commented 2 years ago

Oh cool, thanks! Let me think through this and also read my own (undocumented) code...

leewujung commented 1 year ago

Linking to https://github.com/OSOceanAcoustics/echopype/pull/808#discussion_r984074790 so that we remember to add explanation of how we correct for the reversed time in the documentation and link in the combine_echodata docstring.

b-reyes commented 1 year ago

@leewujung I am not sure if this is the correct issue to bring this up in, but while combining some files I kept running into _clean_ping_time trying to access an index outside of bounds. To help us out, I have identified a single Dataset, where this is occurring. Please see the below code snippet, which triggers the out of bounds error:

ed = ep.open_raw("s3://noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH2106/EK80/Hake-D20210701-T130426.raw", 
                 sonar_model="EK80", storage_options={'anon': True})

from echopype.qc import coerce_increasing_time
coerce_increasing_time(eds[0]["Platform"], "time1")
leewujung commented 1 year ago

@b-reyes and @leewujung discussed and identified the following to be resolved for _clean_ping_time:

leewujung commented 1 year ago

Hey @b-reyes : I finally read through the code you have above -- it is a wayyy better implementation than my recursive function, so I think we should use this.

For this part:

# TODO: account for neg_ind + 1 = len(arr_diff)
# replace negative difference with difference one index after                                         
arr_diff[neg_ind] = arr_diff[neg_ind + 1]

I think it may be better to use the index before the negative jump (neg_ind - 1) to fill in the value. This way the scenario of having neg_ind + 1 > len(arr_diff) also would not happen.

For this other question:

# TODO: do we need to account for neg diff next to eachother?

I agree it is safer to account for that. Seem that we can accomplish this using a simple while loop -- to check for whether there is still negative index, and if so, correct again using the same routine and issue the warning at the same time.

If things like doubly or triply negative diff happens, it may be that the files are actually problematic, like what we saw in the 2021 GPS data. What do you think if we do something like the following?

leewujung commented 1 year ago

Addressed in #1065.