Improved input syntax for `frequency_differencing` operation

praneethratna commented 1 year ago

Addresses #967. Now input for frequency_differencing can be given directly in the form of frequency differencing criteria instead of separating out freq/chan from operators.

CC @leewujung

emiliom commented 1 year ago

These improvements are very cool!

Since you're adding sympy as a dependency, how "heavy" a library is it? In other words, is it likely to create dependency resolution issues? I see the dependency tree sympy > mpmath > gmpy2, where gmpy2 apparently uses a C library ("gmpy2 is an optimized, C-coded Python extension module ...").

Are there lighter-weight alternatives to sympy that would meet our needs? sympy has tons of capabilities, and here I think we'd be using only a tiny subset.

praneethratna commented 1 year ago

These improvements are very cool!

Since you're adding sympy as a dependency, how "heavy" a library is it? In other words, is it likely to create dependency resolution issues? I see the dependency tree sympy > mpmath > gmpy2, where gmpy2 apparently uses a C library ("gmpy2 is an optimized, C-coded Python extension module ...").

I feel SymPy is lightweight compared to other mathematical libraries found online. Also SymPy has a hard dependency only on mpmath and I don't think it would create dependency resolution issues.

Are there lighter-weight alternatives to sympy that would meet our needs? sympy has tons of capabilities, and here I think we'd be using only a tiny subset.

I'm using SymPy for parsing numbers from strings using it's inbuilt parser. I'm not sure of any other libraries which can be used to do this, but there might be some libraries to do this task exclusively?

emiliom commented 1 year ago

Also SymPy has a hard dependency only on mpmath but I don't it would create dependency resolution issues.

Right! Scratch my comment about dependency resolution per se. I did see that the dependency was limited to mpmath > gmpy2, and we don't currently use those libraries. My apologies.

praneethratna commented 1 year ago

Right! Scratch my comment about dependency resolution per se. I did see that the dependency was limited to mpmath > gmpy2, and we don't currently use those libraries. My apologies.

No worries! I understand the clarification now. Since we don't use mpmath or gmpy2 libraries, these dependencies doesn't affect us. But we need to look at alternatives for SymPy if that makes the package slow.

leewujung commented 1 year ago

Since the equations are pretty predictable, one option would be to just do re all the way?

praneethratna commented 1 year ago

Since the equations are pretty predictable, one option would be to just do re all the way?

@leewujung Do you mean we use a single regex expression to check for the validity of the whole frequency differencing criteria? Wouldn't that make the code a bit messy?

leewujung commented 1 year ago

@leewujung Do you mean we use a single regex expression to check for the validity of the whole frequency differencing criteria? Wouldn't that make the code a bit messy?

I think it would be ok? For the freqAB case, the Hz/kHz/MHz/GHz cases are fixed, so the number in front of that could be extracted. For the chanAB case, it is the string between the double quotes. The binary comparison operator >, <, >=, <=, == are the allowed cases. And then we need to extracted the number before dB. This doesn't seem overly complicated?

praneethratna commented 1 year ago

I think it would be ok? For the freqAB case, the Hz/kHz/MHz/GHz cases are fixed, so the number in front of that could be extracted. For the chanAB case, it is the string between the double quotes. The binary comparison operator >, <, >=, <=, == are the allowed cases. And then we need to extracted the number before dB. This doesn't seem overly complicated?

@leewujung I think it shouldn't be overly complicated, it's just that i split everything into individual parts for better code debugging. I'll make these changes by today.

praneethratna commented 1 year ago

@leewujung Even if use a single re expression to check for validity, we again need to use separate re expressions to extract freq/chan values, operators and diff or am i understanding your point in a wrong way?

leewujung commented 1 year ago

Hi @praneethratna : sorry for my super late response. What I am thinking in terms of using re to grab out the pattern is like the following:

use patterns like "(?P<freqA>\d+)(?P<unitA>\w?)Hz" to match freqA and freqB, for example:
```
a = re.match("(?P<freq>\d+)\s*(?P<unit>\w?)Hz", "120kHz") # also matches arbitrary space between "120" and "kHz"
a["freq"]  # get the number associated with the frequency
a["unit"]  # get the unit associated with the frequency
```
then you can use a["unit"] to determine the right multiplication factor to get to the numbers in frequency_nominal, which is with units Hz (i.e. 120kHz is 120000 in frequency_nominal).

you can use the pattern above to build up the whole pattern, something like the following:

freqA_pattern = "(?P<freqA>\d+)\s*(?P<unitA>\w?)Hz"
freqA_pattern = "(?P<freqB>\d+)\s*(?P<unitB>\w?)Hz"
db_pattern = "(?P<db>\d+)\s*dB"
cmp_pattern = "\s*(?P<cmp>\S+?)\s*"
DIFF_MATCHER = re.compile(freqA_pattern + "\s*-\s*" + freqB_pattern + cmp_pattern + db_pattern)

and then use outputs of the matched results with some error checking and handling (like converting str to float, k to 1000 etc) to combine with your other parts of the code.

What do you think?

praneethratna commented 1 year ago

Hi @praneethratna : sorry for my super late response. What I am thinking in terms of using re to grab out the pattern is like the following:
use patterns like "(?P<freqA>\d+)(?P<unitA>\w?)Hz" to match freqA and freqB, for example:
a = re.match("(?P<freq>\d+)\s*(?P<unit>\w?)Hz", "120kHz") # also matches arbitrary space between "120" and "kHz"
a["freq"]  # get the number associated with the frequency
a["unit"]  # get the unit associated with the frequency
then you can use a["unit"] to determine the right multiplication factor to get to the numbers in frequency_nominal, which is with units Hz (i.e. 120kHz is 120000 in frequency_nominal).
you can use the pattern above to build up the whole pattern, something like the following:
freqA_pattern = "(?P<freqA>\d+)\s*(?P<unitA>\w?)Hz"
freqA_pattern = "(?P<freqB>\d+)\s*(?P<unitB>\w?)Hz"
db_pattern = "(?P<db>\d+)\s*dB"
cmp_pattern = "\s*(?P<cmp>\S+?)\s*"
DIFF_MATCHER = re.compile(freqA_pattern + "\s*-\s*" + freqB_pattern + cmp_pattern + db_pattern)
and then use outputs of the matched results with some error checking and handling (like converting str to float, k to 1000 etc) to combine with your other parts of the code.
What do you think?

@leewujung No worries! I understand it now, these suggestions helped me to refactor the code in a much simpler and clean way.

leewujung commented 1 year ago

@praneethratna : there are still places where you use sympy in the code. Could you remove them completely and use re instead? Thanks!

praneethratna commented 1 year ago

@leewujung I have removed the usage of sympy completely and also squashed all the commits into one.

leewujung commented 1 year ago

Also, I caught a bug / undesired behavior in the current _check_source_Sv_freq_diff function. In the current implementation, the values of frequency_nominal is not allowed to be duplicated: https://github.com/OSOceanAcoustics/echopype/blob/254ee106eb580959ca85f2131168c4f17c0a285a/echopype/mask/api.py#L463-L467

However, it is possible that there are 2 channels (with different channel strings) that have the same frequency_nominal values (this is actually why we decided to use channel as the coordinate, instead of frequency_nominal, since coordinates cannot have duplicated values), and users may want to diff these 2 channels using the chanABEq. I think that potentially duplicated frequency_nominal values should only be checked if people use freqABEq to put in their equation.

Looking at the code again, what do you think if we take the following approach to change this function?

keep the first checks # check that channel and frequency nominal are in source_Sv intact
for the below three checks, make it such that only freqAB checks are executed when people use freqABEq, and only chanAB checks are executed when people use chanABEq
```
# make sure that the channel and frequency_nominal values are not repeated in source_Sv
# check that the elements of freqAB are in frequency_nominal
# check that the elements of chanAB are in channel
```
ie we only check for frequency_nominal to not be duplicated and elements of freqAB are in frequency_nominal when people use freqABEq, and only check for channel to not be duplicated and elements of chanAB are in channel when people use chanABEq.

leewujung commented 1 year ago

Ugh sorry just remember one thing. In here https://github.com/OSOceanAcoustics/echopype/blob/254ee106eb580959ca85f2131168c4f17c0a285a/echopype/tests/utils/test_processinglevels_integration.py#L106 frequency_differencing is called (this is something the CI does not catch, unless one uses [all tests ci] (see here). Could you quickly change it? Then I will merge. Thank you!

codecov-commenter commented 1 year ago

Codecov Report

Merging #1106 (d8b8b26) into dev (6fc5a1d) will decrease coverage by 9.44%. Report is 53 commits behind head on dev. The diff coverage is 93.75%.

@@            Coverage Diff             @@
##              dev    #1106      +/-   ##
==========================================
- Coverage   78.12%   68.69%   -9.44%     
==========================================
  Files          65        8      -57     
  Lines        6227      658    -5569     
==========================================
- Hits         4865      452    -4413     
+ Misses       1362      206    -1156

Flag	Coverage Δ
unittests	`68.69% <93.75%> (-9.44%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
echopype/mask/freq_diff.py	`93.44% <93.44%> (ø)`
echopype/mask/api.py	`91.66% <100.00%> (+4.16%)`	:arrow_up:

... and 63 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

OSOceanAcoustics / echopype

Improved input syntax for `frequency_differencing` operation #1106

Codecov Report