feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.79k stars 304 forks source link

cyclical features may not always get the right period #765

Open solegalli opened 1 month ago

solegalli commented 1 month ago

CyclicalFeatures obtains the period by looking for the maximum value. So if months are coded from 1-12, the period will be 12 and that'll be fine.

For hours for example, if the hrs are coded from 1-24, it will get the period as 24 and that will also be fine, but if hrs are coded as 0-23, then the period will be set to 23 automatically, but in reality it is still 24.

Maybe we could use n_unique instead of max value?

In both cases there is still an issue to resolve: if the training set does not have the maximum value, or does not contain all elements, then the period will still be wrongly estimated, but I guess we could make that clear in the docs, I am not sure we can sort that out from our end.

solegalli commented 1 month ago

related to #578

solegalli commented 1 month ago

@glevv any thoughts on this? and would you be interested in trying a fix?

VascoSch92 commented 1 month ago

Hey,

We had more or less the same issue.

What we did is: you give an optional parameter period where you specify the period you want to use. If the period=None, you use as period the max of the unique values.

In general, a periodic function doesn't have always a period. But in the case encountered in real life it has always (example of periodic function without period are non-real world friendly).

glevv commented 1 month ago

The idea of adding period parameter seems OK to me. Should give more control to the user. Implementation could be a bit messy, though

solegalli commented 1 month ago

We have that option already. All happens with the same parameter max_values. If None, we pick the period with max values, alternatively, the user can pass a dictionary with the periods per variable. Maybe we should have named it periods instead of max_values thinking retrospectively, but it's probably a bit late for that.

Irrespective of that, I'd like to improve the auto detection of the period, if possible at all. Not sure it is. Because one user just used the default parameters, got different results between feature-engine and sklearn's implementation, and got confused. So trying to cater for the less experienced or those who not always read the docs. Plus I am not sure if the docs are clear enough, maybe we should explain better how the auto period is calculated. Thinking out loud.

VascoSch92 commented 1 month ago

Irrespective of that, I'd like to improve the auto detection of the period, if possible at all. Not sure it is. Because one user just used the default parameters, got different results between feature-engine and sklearn's implementation, and got confused. So trying to cater for the less experienced or those who not always read the docs. Plus I am not sure if the docs are clear enough, maybe we should explain better how the auto period is calculated. Thinking out loud.

I was reading the docstring of the Cyclical features transformer and I agree with you. The docs could be improved for sure :-)

glevv commented 1 month ago

We can add the following logic: When max_values is None set it to max(n_unique, max_val). If n_unique is lower than max_value - warn user that there is an inconsistency in the data and someone should look into it.

solegalli commented 1 month ago

That's a good idea. It would also work for features with negative values. Would you like to give it a go?

solegalli commented 1 month ago

Well, I am not sure if it takes care of negative values as well on second thoughts.

solegalli commented 1 month ago

Thinking further out loud here, and based on what @VascoSch92 discussed in #578:

if we have time from 0-23, 0 should map to 0, and 23 should map to 2*pi radians, because those are the limits of the sine and cosine function.

if the variable has 1-24 instead, then 1 should map to 0 and 24 to 2*radian. So, just dividing by 24 won't do it. We'd need to rescale as well.

Now, executing a few examples, there is not a huge difference between the output of the function when period is 23 or 24. But I think this needs to be clarified somewhere, particularly because sklearn docs just divides by 24, even when the variable is 0-23, and people will most likely trust sklearn over feature engine.

What's your thoughts on this? maybe we can resolve something before making changes. They might not be necessary.

VascoSch92 commented 1 month ago

Let take in consideration just the sin_transformer of sklearn given by the formula

\sin( \frac{x}{p}*2*\pi), \quad \text{where p is the period}

If you look you are doing two things here:

About the two examples:

if we have time from 0-23, 0 should map to 0, and 23 should map to 2pi radians, because those are the limits of the sine and cosine function. if the variable has 1-24 instead, then 1 should map to 0 and 24 to 2radian. So, just dividing by 24 won't do it. We'd need to rescale as well.

Actually, you are representing hours using different values. In the first example, midnight is 0 but in the second is 24. Now, they have the same period p=24, and you can compute that the sin_transformation give the same output for both (here sin_transformation(0) = sin_transformation(24)).

Why I'm saying that they have the same period? Because if you are at midnight (0 or 24) and you add 24, then you are still at midnight.

just divides by 24, even when the variable is 0-23, and people will most likely trust sklearn over feature engine. Yes because the period is 24 for both actually.

The big questions here is: how to find the period of the columns in the best correct way if the user don't give this information?

I hope my answer was clear :-)

solegalli commented 1 month ago

I am not sure I agree with that. sin_transfomer(0) will be 0, because x/24 is 0. And sin_transformer(24) will be something else, because 24/24 is one. In fact, I don't think it is about finding the period any more. I think it is about finding the best way to rescale the variable between 0 and 1, so that when we multiply by 2*pi radian, the variable max and min coincide with the limits of the sine and cosine functions

solegalli commented 1 month ago

I forgot to apply the sine to my calculation, maybe that's what you meant.

VascoSch92 commented 1 month ago

I forgot to apply the sine to my calculation, maybe that's what you meant.

Yes ;-) $sin(0)=sin(2*\pi)$

VascoSch92 commented 1 month ago

Actually to fix the transformer i believe you should take the number of unique values in the column (you can check with the above examples). With the supposition that every value in the range is at least one time present in the column. I think this is the best that we can make.

glevv commented 1 month ago

I don't think that we need to account for negative values in our calculations. The only thing we can do is to raise an error if we encounter a negative value in Cyclic Transformer. Other than that heuristic with max(n_unique, max_val) should work.

VascoSch92 commented 1 month ago

I don't think that we need to account for negative values in our calculations. The only thing we can do is to raise an error if we encounter a negative value in Cyclic Transformer. Other than that heuristic with max(n_unique, max_val) should work.

For the part of negative values, it is not a problem as sin and cos are defined on the real line. Just the period as to be positive. But i could not find periodic function with negative values in a real-life example (perhaps i lack of imaginations ;))

About your solution, i think it doesn't work. Take as an example the hours of the day represented in the two way

  1. From 0 to 23
  2. From 1 to 24

For 1 you will have period 23 but for 2 period 24, i.e, two different encoding for the same information.

VascoSch92 commented 1 month ago

Hey :-)

what is the status of that?

I can open a PR to fix the issue if you are interested ;-)