feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.92k stars 312 forks source link

GuassianDistributionDiscretizer #385

Open Morgan-Sell opened 2 years ago

Morgan-Sell commented 2 years ago

Is your feature request related to a problem? Please describe. I'm developing a DigitalCommerceIndex class (in Python) that can rank U.S. counties on any continuous numeric variable. To do so, the class allows the user to discretize the continuous variables based on params like index_scale, e.g. 1 to 5 and 1 to 10 and strategy which references feature-engine's EqualFrequencyDiscrteriser and EqualWidthDiscretiser.

I believe that EqualFrequencyDiscretizer is the most appropriate for 99.9% of the problems.

What if I want to rank the U.S. counties on a Guassian-like distribution? I presume 68% of the counties will fall within the average, therefore it would be most appropriate to assign them such a score.

Describe the solution you'd like feature-engine offers a GuassianDistributionDiscretizer!

The number of bins can be 6 or 8, limiting the binning to 3 or 4 standard deviations

Describe alternatives you've considered I thought about using feature-engine's ArbitraryDiscretiser. However, the ArbirtraryDiscretiser requires a predefined dictionary. Meanwhile, the GuassianDistributionDiscretizer will dynamically adjust to the continuous variables.

Additional context It's kind of like StandardScaler discretization.

solegalli commented 2 years ago

Hi @Morgan-Sell

I am being a bit slow perhaps, but I don't understand the logic that the new transformer should have.

Could you give an example with a toy input and a toy output df?

or explain a bit more which are the input variables and what an output variable should look like?

Morgan-Sell commented 2 years ago

Hi @solegalli,

No problem. Imagine the following pandas series:

var_A = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

If we apply the EqualFrequencyDiscretiser with bins = 6, the function (internally) produces the following dictionary:

dct = { 1: [1, 2, 3, 4],
            2: [5, 6, 7, 8],
            3: [9, 10, 11, 12],
            4: [13, 14, 15, 16],
            5: [17, 18, 19, 20],
            6: [21, 22, 23, 24],
}

In contrast, the GaussianDiscretizer would generate the following dictionary:

dct = { 1: [1],
            2: [2, 3, 4],
            3: [5, 6, 7, 8, 9, 10, 11, 12],
            4: [13, 14, 15, 16, 17, 18, 19, 20],
            5: [21, 22, 23],
            6: [24],
}

My use case is to rank U.S. towns based on different continuous variables. I'll do so by applying the EqualFrequencyDiscretizer; therefore the ranking is a uniform distribution. Another approach is to design the ranking as a Gaussian distribution.

Given that many values follow a Gaussian distribution, I thought such an approach could be useful when discretizing continuous variables in which the order has significance.

solegalli commented 2 years ago

That clears things up, thank you!

By any chance, do you have a reference to share about where or how this method has been used?

If we incorporated this method, we would kind of need to point users to scenarios where using this procedure would be appropriate.

Morgan-Sell commented 2 years ago

I see this method being used when designing a normalized index ranking. It's similar to when teachers normalize students' grades, so the majority of students receive Cs (average) and the number of students who have more "extreme" scores decreases.

As mentioned, I'm considering developing this discretizer for an index. I would like the majority of U.S. towns to receive an average score.

I'm thinking one way to perform this transformation is to apply a Gaussian transformation using a log or box-cox transformation then applying the EqualDistanceDiscretizer. If the number of bins equals 6 or 8, wouldn't this more or less perform the desired outcome, i.e., a normalized index?

solegalli commented 2 years ago

Thank you @Morgan-Sell

Excuse my ignorance here, would this be a feature destined to analysis? or would you train a machine learning model with a feature like this?

Morgan-Sell commented 2 years ago

My use case is for analysis. Does feat-engine support such transformations? ;)

I'm unsure how others would use it. I've never used it in an ML scenario. At the same time, I rarely discretize variables when designing ML models. I should probably fix that habit ;)

solegalli commented 2 years ago

At the moment, we have tailored Feature-engine's transformations to those that you would / could use to train machine learning models.

Having said this, I am considering adding a module for analysis. So, I'd say, let's wait a bit, see if this issue gathers support from the community, and when we are ready to incorporate the analysis module, we see how we can make it fit.