Feature/calculate fft energy blocks

anates commented 1 year ago

This feature allows the computation of the frequency energy content of the signal, distributed over n frequency bins. This simplifies the detection of dominant frequencies for different process stages.

This pull request is meant as initial step, to receive feedback (both wrt. code, implementation and intention of the feature) and to improve the code, before it can be merged.

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 87.50% and project coverage change: -0.03 :warning:

Comparison is base (e91e2a3) 93.42% compared to head (55696eb) 93.39%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1024 +/- ## ========================================== - Coverage 93.42% 93.39% -0.03% ========================================== Files 20 20 Lines 1886 1909 +23 Branches 371 375 +4 ========================================== + Hits 1762 1783 +21 Misses 85 85 - Partials 39 41 +2 ``` | [Impacted Files](https://app.codecov.io/gh/blue-yonder/tsfresh/pull/1024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=blue-yonder) | Coverage Δ | | |---|---|---| | [tsfresh/feature\_extraction/settings.py](https://app.codecov.io/gh/blue-yonder/tsfresh/pull/1024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=blue-yonder#diff-dHNmcmVzaC9mZWF0dXJlX2V4dHJhY3Rpb24vc2V0dGluZ3MucHk=) | `100.00% <ø> (ø)` | | | [tsfresh/feature\_extraction/feature\_calculators.py](https://app.codecov.io/gh/blue-yonder/tsfresh/pull/1024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=blue-yonder#diff-dHNmcmVzaC9mZWF0dXJlX2V4dHJhY3Rpb24vZmVhdHVyZV9jYWxjdWxhdG9ycy5weQ==) | `92.43% <87.50%> (-0.04%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

nils-braun commented 1 year ago

@anates - the combiner/simple calculator and your bin index issue makes it probably quite complex, so let me try to simplify the problem a bit: as a start, can you implement a function energy_content_frequency_brackets which gets all parmeters as keyword arguments and just produces a single return value (float)? Your function would look like

def energy_content_frequency_brackets(signal, with_normalization, bin_index):

Would that be possible? I think the function would look something like


def energy_content_frequency_brackets(signal, with_normalization, bin_index):
    freq_vec = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
    fft_vec = np.abs(np.fft.rfft(signal.to_numpy()))
    split_freq_vec = np.array_split(freq_vec, number_of_bins)

    freq_entry = split_freq_vec[bin_index]

    lower_frequency_index = find_nearest(array=freq_vec, value=freq_entry[0])[0]
    upper_frequency_index: int = find_nearest(array=freq_vec, value=freq_entry[-1])[0]
    fft_sum: float = np.sum(fft_vec[lower_frequency_index:upper_frequency_index])
    if with_normalization:
        fft_sum /= len(fft_vec)

    return fft_sum

When I call this function for different bin_index (0, ...) I should basically get the list of values you originally had.

If this works, we can go one step further and make the implementation efficient: instead of doing the FFT for every bin independently we do it for all bins once. This is how a "combiner" works (it is just for making the calculation more efficient). But first, lets make sure the implementation with a "simple" type works!

Hope this helps :) Sorry for all the delayed-ping-pong!

Note: in your current implementation, you also return the "low" and "high" in the feature name. As described above, this is not possible. However we can think of adding yet another feature calculator which just returns the high and one which just returns the low. But this can be a separate PR

anates commented 1 year ago

Hei, I addressed the other points, but have some comments regarding the main implementation (dvs. your last comment):

The "low" and "high"-part of my return string intend to inform the user where the boundaries of that bin are. At least for my understanding those values are quite important, to understand my signal better. Are there other ways of returning those boundaries to the user, or is that on their side to identify them (outside the feature-extractor)?
I agree with you that having a variable feature name might not be the best approach, but the number of bins will always be the same, independently of the data length, as it is based on a user-defined input. This corresponds to (for example) the coefficient-approach which always returns the first 100 coefficients, independently of the data itself. Would that still be ok?

Thanks!

nils-braun commented 1 year ago

Hi! To your first point: Would it work to add them as a second feature calculator? One that returns the values and one that returns the high (and one the low)?

To your second point: Yes indeed, that is the only option in the current way tsfresh is implemented. Except returning a statistics over the values as described above. Not saying this is the best way, but it is just the way tsfresh is currently working...

anates commented 1 year ago

Hei, thanks for the feedback! Based on that, would it be a solution to add an additional input parameter to this extraction function, to either return the bin value (default) for bin n, labelled with _bin_n, or return the high-/low-values for bin n? For typical operation (i.e. feature selection) it would not make a difference, but if the user is interested in the exact boundaries, they can still call this function separately with the same inputs as before, and obtain the bin limits.

This would also reduce the amount of code duplication, else I would have to re-use the same code in two different extractors.

Concerning the second point: Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended? For it being a combiner I then would just have to re-write handling of the input params, to allow it being a list of dicts.

Thanks!

nils-braun commented 1 year ago

Based on that, would it be a solution to add an additional input parameter to this extraction function Good idea!

Very close to what you described is possible (and preferred as you said!). It would be ok to write a function which gets an additional parameter, but if it is a simple function it should only return a single number (the feature name will be set automatically). Something like

def energy_content_frequency_brackets(signal, with_normalization, bin_index, return_mode) -> float:
   if return_mode == "value":
      return fft_sum # single float!
   elif return_mode == "high":
      return 42 # replace with actual number
   elif return_mode == "low":
      return 47 # replace with actual number

Feel free to use different names, this was just an example!

Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended?

Maybe I missed your point, but the current solution as it is implemented right now in the code does not work as expected for tsfresh (again: it is not wrong, it just does not work with the way tsfresh expects feature calculators to work). Let me rephrase what I tried to explain in https://github.com/blue-yonder/tsfresh/pull/1024#issuecomment-1591926985. If you implement a simple calculator, which I would strongly encourage as the first implementation (as a combiner is harder to implement and only is for speeding up the implementation which can be the second step), your calculator is only allowed to return a single number. Not a list and not a feature name - but really just a single number. In the example I gave in the thread above, I only return a single of the bins, while the bin_index is an additional parameter.

def energy_content_frequency_brackets(signal, with_normalization, bin_index) -> float:

(plus the return mode from the question above maybe)

Let me try to give another (maybe simpler) example: assuming you write a simple feature calculator which always returns the three largest number of a time series. You can not write the feature calculator like this

def my_calculator(data) -> List[float]:
   data = list(reversed(sorted(data)))
   return [data[0], data[1], data[2]]

and also not like this

def my_calculator(data) -> List[Tuple[str, float]]:
   data = list(reversed(sorted(data)))
   return [("largest_1", data[0]), ...]

but you need to write it like this

def my_calculator(data, which_largest) -> float:
   data = list(reversed(sorted(data)))
   return data[which_largest]

anates commented 1 year ago

Work on this feature is on hold for the next weeks, but I'll take it up again after my break, if that's ok? Thanks!

anates commented 1 year ago

Hei, I reverted the feature calculator to being a "simple" calculator as an initial proof of concept. I have not adjusted the testing-functions yet, but will do that as soon as the implementation gets closer to being final. However, is the current implementation closer to being useful for tsfresh? Thanks!

blue-yonder / tsfresh

Feature/calculate fft energy blocks #1024

Codecov Report