csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.35k stars 120 forks source link

Rulefit with categories + multi-column problems #117

Closed avraam-inside closed 2 years ago

avraam-inside commented 2 years ago

Hi!

Introductory

In the course of exploring the possibilities of rulefit (via models), questions appeared, I would be happy to discuss them/get hints/etc.

First I will describe the dataframe + code, then I will show the results and there will actually be questions about them.


Input data

An data.csv has been generated in which the gross part of cases (oil supply processes) lasts 4-6 hours (case_duration ~20.000), and there are abnormal cases that last 1.5 days (case_duration > 100.000).

It is necessary to find out - what is the reason for this anomaly?

Here are the conditions that affected the high duration :

image (this can be seen even by human viewing of the table)

In addition, there is an eventlog at the input (only those columns that had at least a minimal impact are served - this was calculated separately earlier), there is a breakdown condition, there is an understanding of what you want to get at the output.


CODE: Sending an eventlog to rulefit

Here is a jupiter notebook code (change format to .ipynb) with a code, here is an (again) data.csv.

Here is the result: image

As you can see, it does not meet expectations somewhat.

Questions:

1) The first thing that catches your eye is some 0.5 and equal signs in different directions. Why do they appear at the exit? After one-hot-encoding, the algorithm has only 0 and 1, a pure category. He knows how to do without them, by the way. Example of a rule from a simple eventlog: image (everything is right here, there is nothing to find fault with).

Is there any way to tell the algorithm not to generate these numbers for category columns?

2) While the conditions are from 5 to 9, the algorithm returns only 3-4 conditions, and with an incorrect answer and large coefficient...

3) Based on the points above, are there any ideas on how to configure the algorithm so that it returns the correct set of rules?


P.S.:

1) About the documentation: in many ways, multi-format arrays are used for examples - why? In general, all cases from my personal practice are based on tables (pandas), because they are more convenient ... 2) ConvergenceWarning message: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 0.000e+00, tolerance: 0.000e+00 - there are 2 points: first, why does it return this message n-times clogging up the console, and not one? secondly, what specific parameters in the submitted RuleFitRegressor are proposed to increase?

avraam-inside commented 2 years ago

Hi!

We found out that RuleFitRegressor returns a minimally working set of rules that allows you to filter the DataFrame and get the same set of abnormal cases, and I was expecting a complete set of columns with their values in vain.

There remains a question of 0.5 for categorical columns, but I can live with this in the current implementation - I will write a tricky regular expression that will remove these <=>0.5 for categories.

At the moment, the question is not very relevant, I may close it later if we don't dig up something else.

csinva commented 2 years ago

Duplicate of #77 (supporting categorical features).