OSOceanAcoustics / echoregions

Interfacing water column sonar data with annotations and labels
https://echoregions.readthedocs.io/
Apache License 2.0
6 stars 6 forks source link

Import Transect Checking from Hake-Labels #162

Closed ctuguinay closed 6 months ago

ctuguinay commented 10 months ago

This addresses #123.

It no longer checks that ST and ET exist. I made this change in the case where an EVR file doesn't contain all the information for a transect, like if it only contains RT-BT-RT-BT-ET or ST-BT-RT-BT-RT segments.

codecov[bot] commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.68%. Comparing base (95cb3e1) to head (fbd97b7). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #162 +/- ## ========================================== + Coverage 87.91% 88.68% +0.77% ========================================== Files 13 13 Lines 571 557 -14 ========================================== - Hits 502 494 -8 + Misses 69 63 -6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ctuguinay commented 9 months ago

@leewujung Thank you for the comments! I'll respond to your longer comments below:

"Dictionary for transect values" does not make sense to me. Do you mean annotation labels of "log" type? Could you re-word this to be more precise?

I wasn't quite sure what to call them since they denote what part of the transect we are on, and log is already a designated class name. Maybe I instead call them transect_sequence_type or something to denote where the EVR region is in the context of the transect map?

is very hard-coded, so I am not sure why the function wouldn't just take transect_strs and transect_next_allowable as inputs directly. Doing that may help users understand what these strings are used for.

Yeah, I can use that convention instead.

could be factored out to be a function to produce warnings (or not) so that the main flow of the transect_mask function is cleaner.

Maybe I separate the transect masking and the checking of the transect sequences?

The 2 cases (start_str followed by break_str or end_str; and resume_str followed by break_str or end_str) are exactly identical to what you have in the transect_next_allowable dictionary, under the keys start_str and resume_str -- and this is obviously should be the case, because that's why these are allowed combinations. How about just using the transect_next_allowable dict to set the within_transect values?

Yup, that sounds cleaner.

Since transect_df["within_transect"] is already a clean column of where in transect_df should be within or outside of a transect segment, and here we only want the "within_transect=True" case, I think you can take out the loop completely.

By the above, do you mean remove this?

 for transect_querying_file_name in transect_querying_list: 
     within_transect_df = transect_df.query( 
         f'file_name == "{transect_querying_file_name}" and within_transect == True' 
     ) 

So instead I just iterate through transect_df within_transect True values only?

leewujung commented 9 months ago

I added back the numbers so that I can reference these more easily:

1

"Dictionary for transect values" does not make sense to me. Do you mean annotation labels of "log" type? Could you re-word this to be more precise?

I wasn't quite sure what to call them since they denote what part of the transect we are on, and log is already a designated class name. Maybe I instead call them transect_sequence_type or something to denote where the EVR region is in the context of the transect map?

Sounds good

2

is very hard-coded, so I am not sure why the function wouldn't just take transect_strs and transect_next_allowable as inputs directly. Doing that may help users understand what these strings are used for.

Yeah, I can use that convention instead.

Sounds good.

3

could be factored out to be a function to produce warnings (or not) so that the main flow of the transect_mask function is cleaner.

Maybe I separate the transect masking and the checking of the transect sequences?

I think you can have a new function for checking the transect sequences to produce warnings (or not), but call this function only within the current transect_mask. The checking is very specific (and so is transect_mask) that I don't see a lot of values in separating these as 2 functions that can be used independently in general.

4

The 2 cases (start_str followed by break_str or end_str; and resume_str followed by break_str or end_str) are exactly identical to what you have in the transect_next_allowable dictionary, under the keys start_str and resume_str -- and this is obviously should be the case, because that's why these are allowed combinations. How about just using the transect_next_allowable dict to set the within_transect values?

Yup, that sounds cleaner.

👍

5

Since transect_df["within_transect"] is already a clean column of where in transect_df should be within or outside of a transect segment, and here we only want the "within_transect=True" case, I think you can take out the loop completely.

By the above, do you mean remove this?

 for transect_querying_file_name in transect_querying_list: 
     within_transect_df = transect_df.query( 
         f'file_name == "{transect_querying_file_name}" and within_transect == True' 
     ) 

So instead I just iterate through transect_df within_transect True values only?

Yeah, that's what I think. And for completeness I added back the second half of this part of comment below:

We've been operating under the assumption that it is expected that region_df may contain timestamp ranges that are larger than what's in da_Sv, and whatever within region_df that fits the "within transect" criteria would be 1s in the output mask. To make this more efficient, we can first filter transect_df bracketed by:

  • region_bbox_left that is immediately before the first timestamp in da_Sv
  • region_bbox_right that is immediately after the last timestamp in da_Sv

and then iterating all rows in transect_df.

ctuguinay commented 9 months ago

@leewujung I incorporated all your previous suggestions. This is ready to be reviewed again.

leewujung commented 9 months ago

Also, please get to #126 also once this one is merged, so that we can release v0.1.1 soon. It'll be better for the label organization stuff to be able to install from main, rather than specific branches.

ctuguinay commented 9 months ago

@leewujung Made the corrections and added a few tests.

And you were right, there was a bug where

transect_df.loc[ transect_df["region_bbox_right"] > max_timestamp, "region_bbox_right" ]

and its partner code is empty and so idx min/max cannot be run, and I dealt with this case by taking the min/max idx of the original region bbox values.