Trusted-AI / AIX360

Interpretability and explainability of data and machine learning models
https://aix360.res.ibm.com/
Apache License 2.0
1.56k stars 306 forks source link

Ripper rule induction algorithm treats timestamp type features as categorical #164

Open kmyusk opened 1 year ago

kmyusk commented 1 year ago

Ripper algorithm recognizes timestamp features (e.g. 2022-06-14-19.39.35.929641) as integers, and thus encodes them to categorical features. The resulting rules are in terms of equality predicates (e.g. timestamp == 2022-06-14-19.39.35.929641) instead of intervals/inequalities as one would expect.

Proper timestamp type support for Ripper would be nice.

wucahngxi commented 7 months ago

The RIPPER (Repeated Incremental Pruning to Produce Error Reduction) algorithm is a rule-based machine learning algorithm used for classification tasks. It is an extension of the well-known CN2 (Class Noise Cleanser) algorithm and is particularly useful for dealing with noisy data.

Regarding timestamp type features, RIPPER indeed treats them as categorical by default. This is because RIPPER is designed to work with discrete (categorical) attributes, and timestamps, being continuous in nature, are discretized before being used in the algorithm.

Here's a typical process when using RIPPER with timestamp features:

Discretization: Continuous features like timestamps are often discretized into intervals or categories. This is necessary because rule-based algorithms like RIPPER require discrete values for their conditions. Rule Generation: RIPPER generates rules based on the discretized features. Each rule consists of a condition and a class label associated with that condition. The conditions are typically based on ranges or specific values in the discretized features. Rule Pruning: The algorithm then goes through a pruning process, where rules that don't contribute significantly to classification accuracy are removed. This helps prevent overfitting. It's important to note that the choice of discretization method for timestamps can impact the performance of the algorithm. Common methods include binning timestamps into intervals, such as days of the week, time of day, or specific date ranges.

If you have timestamps in your dataset and you want to treat them differently (e.g., capturing temporal patterns or trends), you might need to preprocess the data accordingly. This could involve feature engineering to extract relevant information from timestamps or using a different algorithm that can better handle temporal patterns.

In summary, RIPPER treats timestamp features as categorical through a discretization process, and if you need to capture temporal information more effectively, additional preprocessing or the use of other algorithms may be necessary