Open anates opened 1 year ago
Patch coverage: 87.50
% and project coverage change: -0.03
:warning:
Comparison is base (
e91e2a3
) 93.42% compared to head (55696eb
) 93.39%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@anates - the combiner/simple calculator and your bin index issue makes it probably quite complex, so let me try to simplify the problem a bit:
as a start, can you implement a function energy_content_frequency_brackets
which gets all parmeters as keyword arguments and just produces a single return value (float)? Your function would look like
def energy_content_frequency_brackets(signal, with_normalization, bin_index):
Would that be possible? I think the function would look something like
def energy_content_frequency_brackets(signal, with_normalization, bin_index):
freq_vec = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
fft_vec = np.abs(np.fft.rfft(signal.to_numpy()))
split_freq_vec = np.array_split(freq_vec, number_of_bins)
freq_entry = split_freq_vec[bin_index]
lower_frequency_index = find_nearest(array=freq_vec, value=freq_entry[0])[0]
upper_frequency_index: int = find_nearest(array=freq_vec, value=freq_entry[-1])[0]
fft_sum: float = np.sum(fft_vec[lower_frequency_index:upper_frequency_index])
if with_normalization:
fft_sum /= len(fft_vec)
return fft_sum
When I call this function for different bin_index
(0, ...) I should basically get the list of values you originally had.
If this works, we can go one step further and make the implementation efficient: instead of doing the FFT for every bin independently we do it for all bins once. This is how a "combiner" works (it is just for making the calculation more efficient). But first, lets make sure the implementation with a "simple" type works!
Hope this helps :) Sorry for all the delayed-ping-pong!
Note: in your current implementation, you also return the "low" and "high" in the feature name. As described above, this is not possible. However we can think of adding yet another feature calculator which just returns the high and one which just returns the low. But this can be a separate PR
Hei, I addressed the other points, but have some comments regarding the main implementation (dvs. your last comment):
Thanks!
Hi! To your first point: Would it work to add them as a second feature calculator? One that returns the values and one that returns the high (and one the low)?
To your second point: Yes indeed, that is the only option in the current way tsfresh is implemented. Except returning a statistics over the values as described above. Not saying this is the best way, but it is just the way tsfresh is currently working...
Hei,
thanks for the feedback! Based on that, would it be a solution to add an additional input parameter to this extraction function, to either return the bin value (default) for bin n
, labelled with _bin_n
, or return the high-/low-values for bin n
? For typical operation (i.e. feature selection) it would not make a difference, but if the user is interested in the exact boundaries, they can still call this function separately with the same inputs as before, and obtain the bin limits.
This would also reduce the amount of code duplication, else I would have to re-use the same code in two different extractors.
Concerning the second point: Assumed I understood your point correctly, the current implementation of returning n
bins is already working as intended? For it being a combiner I then would just have to re-write handling of the input params, to allow it being a list of dicts.
Thanks!
Based on that, would it be a solution to add an additional input parameter to this extraction function Good idea!
Very close to what you described is possible (and preferred as you said!). It would be ok to write a function which gets an additional parameter, but if it is a simple
function it should only return a single number (the feature name will be set automatically). Something like
def energy_content_frequency_brackets(signal, with_normalization, bin_index, return_mode) -> float:
if return_mode == "value":
return fft_sum # single float!
elif return_mode == "high":
return 42 # replace with actual number
elif return_mode == "low":
return 47 # replace with actual number
Feel free to use different names, this was just an example!
Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended?
Maybe I missed your point, but the current solution as it is implemented right now in the code does not work as expected for tsfresh (again: it is not wrong, it just does not work with the way tsfresh expects feature calculators to work). Let me rephrase what I tried to explain in https://github.com/blue-yonder/tsfresh/pull/1024#issuecomment-1591926985.
If you implement a simple
calculator, which I would strongly encourage as the first implementation (as a combiner
is harder to implement and only is for speeding up the implementation which can be the second step), your calculator is only allowed to return a single number. Not a list and not a feature name - but really just a single number.
In the example I gave in the thread above, I only return a single of the bins, while the bin_index
is an additional parameter.
def energy_content_frequency_brackets(signal, with_normalization, bin_index) -> float:
(plus the return mode from the question above maybe)
Let me try to give another (maybe simpler) example: assuming you write a simple feature calculator which always returns the three largest number of a time series. You can not write the feature calculator like this
def my_calculator(data) -> List[float]:
data = list(reversed(sorted(data)))
return [data[0], data[1], data[2]]
and also not like this
def my_calculator(data) -> List[Tuple[str, float]]:
data = list(reversed(sorted(data)))
return [("largest_1", data[0]), ...]
but you need to write it like this
def my_calculator(data, which_largest) -> float:
data = list(reversed(sorted(data)))
return data[which_largest]
Work on this feature is on hold for the next weeks, but I'll take it up again after my break, if that's ok? Thanks!
Hei, I reverted the feature calculator to being a "simple" calculator as an initial proof of concept. I have not adjusted the testing-functions yet, but will do that as soon as the implementation gets closer to being final. However, is the current implementation closer to being useful for tsfresh? Thanks!
This feature allows the computation of the frequency energy content of the signal, distributed over n frequency bins. This simplifies the detection of dominant frequencies for different process stages.
This pull request is meant as initial step, to receive feedback (both wrt. code, implementation and intention of the feature) and to improve the code, before it can be merged.