david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

How to re-train when a few values are marked as anomalies when they should not have been? #47

Closed seperman closed 1 year ago

seperman commented 1 year ago

Hello, Let's say we have an array of size n. A few items are marked as anomalies that should not have been. How do you recommend refitting the model, so those items are not marked as anomalies in the future? I considered extending the array with X copies of those items and re-training. Is that the right approach? If yes, what is the optimal value for X?

Example (columnar data):

array = [1, 4, 3, 15, ...]
15 is marked as an anomaly.
We copy 15 multiple times.
array = [1, 4, 3, 15, 15, 15 ...]
And fit the model again.

If it matters, here are the parameters I'm using:

IsolationForest(
            ndim=1, ntrees=100,
            penalize_range=False,
            prob_pick_pooled_gain=0,
            missing_action="impute",  # Dealing with None values
            new_categ_action="impute",  # Dealing with new categories
        ) 
seperman commented 1 year ago

I went ahead and used the Titanic dataset. Then added ages 232, 222,199 as not anomalies. I had those numbers repeated and appended to the ages column. Now it gives age 20000000000 the same score as the age 10 and 232. Why?

Screenshot from 2023-03-21 23-27-08

seperman commented 1 year ago

Instead of repeating those new numbers, I turned them into a normal distribution. It still recognizes 100 as an anomaly but not 20000000000: Screenshot from 2023-03-22 00-06-25

david-cortes commented 1 year ago
  1. This is an unsupervised method so it doesn't have any concept of labels. The only things you can do in that regard are adjusting weights as you mention and adjusting fitting parameters to be more suitable for isolating anomalies in your data.

  2. and 3. This software is based on decision trees. You can read more about the algorithm in the references.