david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
192 stars 38 forks source link

R Crashes when running isolation.forest ndim possibly the issue #3

Closed tararae7 closed 4 years ago

tararae7 commented 4 years ago

Hi, I believe i have narrowed it down to the ndim parameter. Each time i run isolation.foreset with the ndim parameter >1 R session aborts immediately. Can you please help me understand why this happens?

david-cortes commented 4 years ago

Thanks for the bug report. I just realized that the R version doesn't make any checks on whether ndim is less or equal than the number of columns in the data, which I'll add now.

Are you trying to use an ndim larger than the number of columns? if not, could you please provide more information:

tararae7 commented 4 years ago

Hi David,

No, I am not trying to use ndim larger than the number of columns.

Script I tried to run: (works if i change ndim from 2 to 1) test100 <- isolation.forest(df[,c("variable1","variable2")], ntrees =100,sample_size=256,ndim=2,nthreads=1, prob_pick_pooled_gain=1, prob_pick_avg_gain=0)

tararae7 commented 4 years ago

Not sure i this is the best place to ask this question. I am trying to compare this package to another isolation forest package. Is this compare able to other isolation forest packages? I would rather use this package because it has many more features but want to find a way to do a check.

david-cortes commented 4 years ago

Ok, managed to reproduce the crash when both of the columns are factor and there are no numerical columns. Will investigate it. But could you please also add some lines to generate the data or describe what kind of columns does it have.

Also no, that's not comparable to other isolation forest implementations - with those hyperparameters it corresponds to the ~SCiForest~ Fair-Cut forest variant (that's not standard and is only implemented in this package). For others (the regular isolation forest), you'd have to pass the following:

test100 <- isolation.forest(df[,c("variable1","variable2")],
    ntrees =100,
    sample_size=256,
    ndim=1,
    prob_pick_pooled_gain=0,
    prob_pick_avg_gain=0,
    missing_action = "fail",
    penalize_range = FALSE)

But nevertheless, you are likely to obtain better results with some non-random splits.

david-cortes commented 4 years ago

Fixed now. Could you please install from github (devtools::install_github("david-cortes/isotree") or remotes::install_github("david-cortes/isotree")) and confirm if it works for you now?

tararae7 commented 4 years ago

Got an error when trying to install from both methods. /isotree_0.1.13.tar.gz’ had non-zero exit status

david-cortes commented 4 years ago

Could you post the full output? Please use three backticks (```) above and below the pasted output text so that it doesn't get formatted as markdown text.

david-cortes commented 4 years ago

Just realized that both remotes and devtools have broken functionality in Windows when installing the latest versions in CRAN. Here's the source file - can be installed from RStudio's Tools -> Install packages... -> Install from -> Package Archive File (.tar.gz). isotree_0.1.13.tar.gz

tararae7 commented 4 years ago

Thank you I will give it a try.

tararae7 commented 4 years ago

It did work this time (the original parameter configuration) however the one you sent me is now failing when i use 2:

isotree_mdl <- isolation.forest(isotree_features, ntrees =100, sample_size=256, ndim=2, prob_pick_pooled_gain=0, prob_pick_avg_gain=0, missing_action = "fail", penalize_range = FALSE)

david-cortes commented 4 years ago

Ok, thanks for the info, will fix that one too. In the meantime, you can try it without passing missing_action - the end result is the same, but it won't crash.

tararae7 commented 4 years ago

Yes that worked. Thanks! Just a note. I am trying to compare to isofor and not able to get the same results but maybe that is because of the randomness in the algorithm.

david-cortes commented 4 years ago

Fixed the issue with missing_action="fail" - could you give it one last try?

About comparing it against isofor: that package also is a bit non-standard, and it handles categorical variables differently, so results will be different. Moreso if you're using ndim>1, which is the extended isolation forest model (different from what isofor does).

If you want to get comparable results, you'd have to use only numeric variables, the regular model (ndim=1), and many isolation trees in order to reduce the effect of randomness - the results should be roughly the same as with isofor and IsolationForest (last one uses the equivalent of penalize_range=TRUE) isotree_0.1.13.tar.gz

tararae7 commented 4 years ago

I just tested with missing_action="fail" and it works. Thanks.

About comparing it isotree and isofor: right now I am using isofor and converting to factors for categorical data but it limits to 32 levels. Which is why i am looking to use something that is more flexible.

I'd like to test out the extended version ndim>1 but right now i am just trying to compare the two. I understand how isofor handles categorical variables but i don't understand how isotree handles them. Right now it looks like its doing some sort of grouping and giving the same score for groups of records. Can you please let me know if i am missing a parameter or something. Also can you explain how categorical variables are handled in the algorithm? Thanks!

david-cortes commented 4 years ago

Thanks, good to know that it's working now.

As for categorical variables, it's explained in the documentation for the parameters about them, but roughly:

tararae7 commented 4 years ago

That is helpful thanks!

tararae7 commented 4 years ago

One more quick question. Please. In your explanation above for the random subset selection for using ndim=1. Is there any specification in mind when doing the random splits on factors? Is this where entropy comes in? Since a category is being represented as a factor and there is really not any meaningful numeric value i am having a hard time understanding how the anomalies get separated.

david-cortes commented 4 years ago

Anomalies for categorical variables get separated according to the frequency of each category within a tree node - thus, if all observations in a node have the same category except for 1 or 2, those 1 or 2 will be outliers and will get isolated quicker in expectation.

In the simplest case, if you have only 1 categorical variable with 2 levels, and you get these counts:

col1
----
a |b
10|2

Then the data will get split just by that, and the observations having category "b" will end up in a node with fewer observations, thus will get assigned a higher outlier score.

If you have two categories which are related, it's the same thing - for example, suppose you have the following confusion matrix:

 |  a  |  b
-|-----|-----
p|  50 |  5
q|  3  |  100

Then you know the two columns are related, and some combinations are more likely to occur together. Since it has only 2 columns with 2 values each, it will get partitioned as each cell in that table ending up in its own terminal node, and cells (a,q), (b,q) will get a more anomalous score than the others.

When there's more categories, it's a similar process, only some trees will get opposite results but those results will then get averaged and you'll get something which in expectation does the same thing.

You can also check it in R through some toy example:

library(isotree)
library(dplyr)
### 'col1' is random uniform, 'col2' has a 90% change of having a
### value that depends on 'col1'.
set.seed(123)
df <- data.frame(col1 = sample(c("a", "b"), 1000, replace=TRUE))
df %>%
    mutate(prop = ifelse(col1 == "a", .9, .1),
           col2 = ifelse(runif(1000) < prop, "p", "q")) %>%
    select(-prop) ->
    df

iso <- isolation.forest(df, ndim=1)
df$score <- predict(iso, df)

df %>%
    group_by(col1, col2) %>%
    summarize(cnt = n(), avg_score = mean(score)) %>%
    arrange(avg_score)
# A tibble: 4 x 4
# Groups:   col1 [2]
  col1  col2    cnt avg_score
  <chr> <chr> <int>     <dbl>
1 a     p       457     0.504
2 b     q       443     0.506
3 b     p        51     0.617
4 a     q        49     0.620
tararae7 commented 4 years ago

Thanks for the response. I will review to make sure I understand and run the example. There may be a few more questions. Thanks again

tararae7 commented 4 years ago

You said "Anomalies for categorical variables get separated according to the frequency of each category within a tree node" where does the "random subsets" come in then? Thanks!

david-cortes commented 4 years ago

Perhaps that was a bad explanation. I meant to say that the less frequent categories within a given node will tend to have higher outlier scores in the end - in the most extreme case, you could think of what happens if you have a single column with all the rows having the same value except for 1 row.

But the process of arriving at those outlier scores is different and that's where the randomization comes in: each possible grouping of categories has the same chance of being chosen regardless of the frequencies (if you are using random splits). If you have only 2 categories, there's only 1 possible way to split them, but with 3 or more it's different.

If you have only 1 column, with 3 possible values (e.g. a,b,c), and you build random trees on it, then on average the outlier scores will be proportional to the frequency of each category in the data, but each individual tree might assign different scores to them (e.g. one tree will make the fist split as ab|c while another will choose ac|b).

In R:

set.seed(123)
df <- data.frame(col1=sample(c("a","b","c"),1000,T,c(0.8,0.15,0.5)))
df$score <- isolation.forest(df, output_score=T, ndim=1)$score
library(dplyr)
df %>%
    group_by(col1) %>%
    summarize(avg_score=mean(score), cnt=n()) %>%
    arrange(avg_score)
# A tibble: 3 x 3
  col1  avg_score   cnt
  <chr>     <dbl> <int>
1 a         0.502   560
2 c         0.528   345
3 b         0.588    95

Now, when you have more columns, the other splits will also be random and on potentially other columns which will form groups with different distributions of frequencies in a given node than in the whole data, and that's where the relationship of some categorical variable with other variable comes in.

For example, if you have a numerical variable and a categorical variable which is just the numerical one discretized, then changing the value of the categorical variable to something else will give it a high outlier score, regardless of whether that other category is frequent in the data as a whole:

set.seed(123)
df <- data.frame(col1 = rgamma(1000, 10))
df$col2 <- cut(df$col1, breaks=4)
row.w.val.low <- (1:1000)[df$col1 <= 5][1]
row.w.val.high <- (1:1000)[df$col1 >= 7 & df$col1 <= 12][1]
df$col2[row.w.val.low] <- df$col2[row.w.val.high]
iso <- isolation.forest(df, ndim=1, ntrees=100)
df$score <- predict(iso, df)
df %>% arrange(desc(score)) %>% head()
       col1        col2     score
1  4.922060 (7.43,12.4] 0.7917227
2 22.414278 (17.4,22.4] 0.7879238
3  2.429620 (2.41,7.43] 0.7646923
4 21.513327 (17.4,22.4] 0.7622442
5  2.813401 (2.41,7.43] 0.7319049
6 20.305465 (17.4,22.4] 0.7287824

There you'll see that the row with swapped categories will come near the top in the scores, as it is an unfrequent category conditioned on some of the splits by the numeric column, even though the category itself is common in the data when not looking at the numeric column.

tararae7 commented 4 years ago

Thank you! I really appreciate your responsiveness. In your extreme example of all rows having the same value except for 1 row. a:10/b:1 this is what i understood:

Since there are only two categories, the 'a' category with 10 records would be one terminal node and b category with 1 record would be another terminal node. That is where the splitting ends and they are essentially the same anomaly score?

david-cortes commented 4 years ago

Not exactly - you can check the reference papers to see how the score is determined from the node size.

tararae7 commented 4 years ago

Can you point me in the direction of where the reference papers talk about the scoring calculation? Is it in the isofor pdf or somewhere else. Thanks

tararae7 commented 4 years ago

I found your paper written on Distance approximation using Isolation Forests and I am assuming that explains it. I will take a look. Thanks

david-cortes commented 4 years ago

It's explained in the original isolation forest paper: Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008. On page 3.

tararae7 commented 4 years ago

Thanks David. I am working with 2 million records initially using just 2 features in the model. The model fitting runs really fast but the predict never finishes. Any suggestions?

david-cortes commented 4 years ago

The predict functionality is optimized for predicting scores on a single or few rows, not for predicting scores on large numbers of observations, so it can actually take longer to call predict on many rows than to build the model itself – in this case, you can use output_score=TRUE, which will calculate the scores as it fits the model.

However, if you have 2M rows and you are building models with the default parameters, it should take a very long time to build the model - way longer than to call predict - so I’m guessing that you’re not using the default parameters, and nevertheless, making predictions on only 2M rows show take at most a few seconds in a modern computer.

This might be another crash-like error that gets stuck on an infinite loop or something like that – please provide some info on how you are building the model (the exact function call), and what type of data are you passing (column types, number of factors if you have, and better yet if you can provide a sample data-generation code snippet).

tararae7 commented 4 years ago

This is the way i am building the model:

The two features i am using are a character (1,216 levels) and a character (137 levels)

isotree_mdl <- isolation.forest(isotree_features, ntrees =100,sample_size=256,ndim=1, prob_pick_pooled_gain=1, prob_pick_avg_gain=0)

david-cortes commented 4 years ago

I'm not able to reproduce the issue:

library(isotree)

m <- 2e6
df <- data.frame(
    col1 = as.character(sample(1216, size=m, replace=TRUE)),
    col2 = as.character(sample(137, size=m, replace=TRUE))
)

isotree_mdl <- isolation.forest(df, ntrees =100,sample_size=256,ndim=1,
                                prob_pick_pooled_gain=1,
                                prob_pick_avg_gain=0)
microbenchmark::microbenchmark({p <- predict(isotree_mdl, df)}, times=10L)
Unit: milliseconds
                                  expr      min       lq     mean   median       uq      max neval
 {     p <- predict(isotree_mdl, df) } 830.5357 854.8574 941.1273 993.4023 1007.743 1013.983    10

(i.e. median and maximum time are around 1s when I run them)

Does the code above also take long to run?

tararae7 commented 4 years ago

That only took about 30 seconds to run

Unit: seconds expr min lq mean median uq max neval { p <- predict(isotree_mdl, df) } 3.770995 3.815796 4.061963 3.955422 4.224178 4.786653 10

I wonder why that would work but not the model I ran...

tararae7 commented 4 years ago

oh I see one difference. You are using factors. Let me convert to factors.

tararae7 commented 4 years ago

Still not working after i converted to factors. That seems really weird to me.

david-cortes commented 4 years ago

Ok, I was able to reproduce the issue, only you also have to pass new_categ_action="smallest" and categ_split_type="single_categ", will investigate it.

tararae7 commented 4 years ago

Ok so no way around it at this time?

david-cortes commented 4 years ago

No, and will take a while to fix. But basically the issue is that, if it ends up with some sample in which all rows have a different category, this will happen - hadn't foreseen such situations happening, and really, this type of model is not meant to tackle such data samples.

Also on top of that there's another (fixable) issue too with gain calculations, which I'll try to fix.

tararae7 commented 4 years ago

Thanks. Are you saying this model is not appropriate for my data?

david-cortes commented 4 years ago

Indeed - if you have only 2 categorical variables, then what you'll get in the end is simply a score which is dependent on their cross-frequency. If you had more columns and you try very large sample sizes, you might get reasonable results, but it will take a long time. Also you might want to convert those categories into something else (e.g. represent them by their frequency).

tararae7 commented 4 years ago

Oh I have many other features, I could use. I thought that starting with only two would be helpful. Ill try more features in my model.

david-cortes commented 4 years ago

Fixed both issues and did some extra tests when fitting the model to categorical-only variables - please give it one more try: isotree_0.1.13.tar.gz

david-cortes commented 4 years ago

Fixed yet another crash when there are edge cases in both numeric and categorical variables, in case you also encounter it - guess this one will go into CRAN later this week: isotree_0.1.13.tar.gz

tararae7 commented 4 years ago

Still taking a long time. I went to lunch while it ran. I stopped it when i got back.

Its weird that all of the records have a score of at least .5. and nothing above .6. I was expecting more of many records below .5 and less closer to 1. This is when i add more features to the model and it finishes.

tararae7 commented 4 years ago

David,

I am curious why my dateset isn't appropriate for this model? I haven't read anything so far that indicates i wouldn't have success finding anomalies in my data set which is Pcard transactions. Do you suggest a different model type.

david-cortes commented 4 years ago

Well, if you are passing only a few columns with hundreds or thousands of categorical levels, and you are using small sample sizes such as 256, then the trees will pretty much have each row with a different value, each tree picking different levels, and each tree predicting the rest of the categories as new/unseen when you apply it to the rest of the data (i.e. all other observations getting the same score from that tree), which leads to having uniform scores all very close to 0.5 (unless you use use dozens of thousands of trees with very large sample sizes each, which is likely to take too much time).

Also if the categories are more or less independent of each other, you'll have a hard time drawing a threshold for considering anomalies regardless of the method.

I don't think I could recommend you any method for such type of data - pretty much all algorithms and criteria are oriented towards numeric-only. You might try 1-d methods too, or association rules (e.g. package arules), or you might try the outliertree package, but it's quite slow and not aimed at large datasets, and also depends on the kind of outliers that you expect to find.

So you'd likely have to reduce the number of categories beforehand - e.g. by merging categories, or by swapping categories for aggregated indicators by category or something like that (you might want to ask in some place like stack exchange or similar).


About the predict that still doesn't finish: does it work if you try to make predictions on only 5 rows or so? If not, could you again please give some info about what parameters you used and what kind columns.

tararae7 commented 4 years ago

Everything seems to be working fine now. I did upgrade R for other reasons.

tararae7 commented 4 years ago

HI David,

I just wanted to confirm with you. If I have 2 mil records with 24 features all except 1 are categorical, you are saying that using Isotree will not give good results for finding anomalies? Just want to make sure you understand that i have more than 2 features I was just using 2 features as an example to compare to other isolation Forest model packages.

I have been researching anomaly detection algorithms for awhile and before i move on to a different solution, i want to make sure i understand that this will not work because of my data set.

david-cortes commented 4 years ago

I'm not saying it won't give good results, but I am saying that it won't give good results with the hyperparameters that you want to use, because in practice it will just get samples of data with all different categories at everything, and too few trees to reflect expected values from random choices. If you were to use it instead with no sub-sampling and building 50k trees, chances are, you'd get good results, but that's not feasible to run on a desktop computer. Perhaps you might get good results with something intermediate (e.g. 100k sample size and 5k trees), but I cannot tell you for sure, and in case, it will be very slow and use a lot of memory.

tararae7 commented 4 years ago

Thanks David. I am trying frequency encoding for categorical variables. The issue i am having now is that the frequencies which are higher are ending up being the unusual records. Basically the opposite of what i am trying to do.

david-cortes commented 4 years ago

Next issue there is: you're using the pooled gain option, which tends to find "rare" cases only within the distribution of the data from each tree, but generalizes poorly to new data. In your case, if you take a sample size of 256, pretty much all data is new data for a given tree, and such data will very likely be flagged as non-outlier - you can check the plots in the example in the documentation for an idea of what that option does.

david-cortes commented 4 years ago

All updates pushed to CRAN now.