david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

isolation.forest() is not reproducible whenever `nthreads > 1` #26

Closed bgreenwell closed 3 years ago

bgreenwell commented 3 years ago

Hi @david-cortes, thanks for a great package. I'm writing a book on tree-based methods and am including a section on isolation forests using your package (which works really well). I've noticed, however, that the anomaly scores are not reproducible (at least for me) when specifying the seed via set.seed() or the random_seed argument. Reproducible example below:

library(isotree)

# Generate fake data (no anomalies)
set.seed(101)
X <- as.data.frame(matrix(rnorm(5 * 100), ncol = 5))

# Fit an isolation forest
ifo <- isolation.forest(X, random_seed = 102)

# Compute anomaly scores
head(scores <- predict(ifo, newdata = X))
# [1] 0.4002608 0.4996714 0.5253563 0.4303659 0.4204118 0.4323855

#
# Run again, but notice different scores with same seed
#

# Generate fake data (no anomalies)
set.seed(101)
X <- as.data.frame(matrix(rnorm(5 * 100), ncol = 5))

# Fit an isolation forest
ifo <- isolation.forest(X, random_seed = 102)

# Compute anomaly scores
head(scores <- predict(ifo, newdata = X))
# [1] 0.3950409 0.4929140 0.5302152 0.4239435 0.4225947 0.4325836

Is this a bug, or am I missing something?

bgreenwell commented 3 years ago

Just realized this is noted in the README (apologies for not noticing that first) and the issue seems fixed when setting nthreads = 1, nonetheless, it seems appropriate to leave this as an open issue? This is especially problematic since nthreads > 1 by default in most cases. Perhaps consider changing the default to nthreads = 1.

david-cortes commented 3 years ago

It should be reproducible, but there was a bug in previous versions which would make it irreproducible, especially when ndim is odd. Some questions:

bgreenwell commented 3 years ago

Ahh, I was indeed using an older version. Seems to work fine in the latest CRAN release (v0.2.7)! Thanks for the quick response @david-cortes .

ThomasZhang717 commented 3 years ago

It should be reproducible, but there was a bug in previous versions which would make it irreproducible, especially when ndim is odd. Some questions:

  • Are you using the latest version?
  • Is it also irreproducible if you use coefs="uniform"?
  • is it also irreproducible if you use ndim=1 or ndim=2?

Hi David, I know this issue is closed, but I think there is no need to open a similar one. I have the same issue. I believe it should be reproducible. However, when ndim = 1, it is not.

david-cortes commented 3 years ago

@ThomasZhang717 Thanks for the information. I'm however unable to find non-reproducibility.

The following snippet always gives the same result for me:

library(isotree)
set.seed(1)
X <- matrix(rnorm(100 * 5), ncol=5)
rnd_ix <- matrix(c(sample(100, size=20, replace=TRUE),
                   sample(5, size=20, replace=TRUE)), ncol=2)
X[rnd_ix] <- NA
model <- isolation.forest(X, ndim=1, missing_action="divide",
                          random_seed=123, nthreads=3)
predict(model, X)
ThomasZhang717 commented 3 years ago

@ThomasZhang717 Thanks for the information. I'm however unable to find non-reproducibility.

The following snippet always gives the same result for me:

library(isotree)
set.seed(1)
X <- matrix(rnorm(100 * 5), ncol=5)
rnd_ix <- matrix(c(sample(100, size=20, replace=TRUE),
                   sample(5, size=20, replace=TRUE)), ncol=2)
X[rnd_ix] <- NA
model <- isolation.forest(X, ndim=1, missing_action="divide",
                          random_seed=123, nthreads=3)
predict(model, X)
  • What kind of input data are you passing? (e.g. df, matrix, types of columns, etc.)
  • Are you able to make a small example with random data?

Hi, David. I have tried the same codes in your reply. Yes, it gives me the same results. However, if I add sample_size = 50 in isolation.froest, the results will be slightly different as A and B. This is missed in my last reply, my bad. A. [1] 0.4268346 0.4931192 0.5216219 0.4354687 0.5143806 0.5548655 0.4078764 0.4211939 0.4374525 [10] 0.4108939 0.4422375 0.3918356 0.4751745 0.5103656 0.4215653 0.4399065 0.4014527 0.4310644 [19] 0.4747911 0.4199024 0.5115887 0.4848320 0.3998537 0.4907188 0.4206371 0.4873190 0.4159650 [28] 0.4799154 0.4195693 0.3867677 0.4945219 0.5163964 0.4270563 0.4516901 0.4399254 0.4307343 [37] 0.4139854 0.4142398 0.4335430 0.4079231 0.4539450 0.5146048 0.4380907 0.4149262 0.5220550 [46] 0.5205168 0.4969488 0.3905622 0.4176541 0.5524954 0.4501555 0.3969415 0.3859549 0.4474579 [55] 0.5252444 0.4851237 0.5147711 0.4598579 0.4973965 0.4574082 0.5514348 0.4821773 0.4498562 [64] 0.4313696 0.4457145 0.4796373 0.4570787 0.4943975 0.4146116 0.4703239 0.4833987 0.4250820 [73] 0.4212746 0.4949767 0.4767075 0.4213967 0.5084094 0.4753492 0.4625709 0.4879854 0.4181288 [82] 0.4450953 0.4618560 0.4974802 0.4780901 0.4630327 0.5233225 0.4190874 0.4849765 0.4408392 [91] 0.4337941 0.4856868 0.4475284 0.4498174 0.5686211 0.4389855 0.5160242 0.4328694 0.4403108 [100] 0.4185283

and

B. [1] 0.4265170 0.4931192 0.5217908 0.4353271 0.5138443 0.5547556 0.4077856 0.4211001 0.4374525 [10] 0.4107204 0.4417625 0.3916397 0.4751519 0.5110519 0.4213545 0.4401813 0.4013222 0.4309243 [19] 0.4747284 0.4197659 0.5112202 0.4844610 0.3996554 0.4907188 0.4205003 0.4876234 0.4156499 [28] 0.4797594 0.4197446 0.3865743 0.4942121 0.5163964 0.4268513 0.4511156 0.4399044 0.4305637 [37] 0.4138018 0.4141051 0.4333262 0.4078067 0.4538551 0.5145708 0.4380907 0.4147913 0.5220550 [46] 0.5205168 0.4967287 0.3904352 0.4171229 0.5523859 0.4498859 0.3966461 0.3858294 0.4471190 [55] 0.5246195 0.4848811 0.5147711 0.4595157 0.4973965 0.4573865 0.5513256 0.4818207 0.4501879 [64] 0.4312980 0.4458587 0.4791602 0.4575123 0.4937687 0.4144768 0.4703239 0.4833757 0.4247656 [73] 0.4215005 0.4951402 0.4767075 0.4212098 0.5080533 0.4748763 0.4625489 0.4879044 0.4175143 [82] 0.4448981 0.4621184 0.4975787 0.4780350 0.4628822 0.5233225 0.4189015 0.4849765 0.4410233 [91] 0.4334816 0.4855787 0.4471363 0.4497960 0.5685048 0.4391440 0.5157833 0.4326775 0.4405469 [100] 0.4182583

I run the entire codes 10 times. Mostly, it gives me A, sometimes, it shows B. A and B just have some small difference, like the first instance.

In my case, the format is almost same as your example, the difference is that I use data frame.

david-cortes commented 3 years ago

@ThomasZhang717 I'm still unable to find any irreproducibility.

This code runs without problems on my setup:

library(isotree)
set.seed(1)
X <- matrix(rnorm(100 * 5), ncol=5)
rnd_ix <- matrix(c(sample(100, size=20, replace=TRUE),
                   sample(5, size=20, replace=TRUE)), ncol=2)
X[rnd_ix] <- NA
set.seed(1)
for (i in 1:100) {
    model <- isolation.forest(X, ndim=1, missing_action="divide",
                              sample_size=50, random_seed=123,
                              nthreads=sample(3, size=1)+1)
    pred <- predict(model, X)

    if (i > 1) {
        diff = abs(pred - last_pred)
        if (any(diff != 0))
            stop("Different results")
    }
    last_pred <- pred
}

And same if I change it to as.data.frame(X) or to a more uneven distribution like rgamma.

Some more questions:

david-cortes commented 3 years ago

I was now able to reproduce the issue, but only on Windows. Will investigate.

ThomasZhang717 commented 3 years ago

I was now able to reproduce the issue, but only on Windows. Will investigate.

Yes, I was trying this on my old mac. There is no issue, works perfectly. However, when I move to windows, it happens.

For some more information:

Thanks, David. ;)

david-cortes commented 3 years ago

@ThomasZhang717 I've pushed a small update which should fix the problem. Could try the latest version from GitHub and see if you still experience this bug?

remotes::install_github("david-cortes/isotree")
ThomasZhang717 commented 3 years ago

@ThomasZhang717 I've pushed a small update which should fix the problem. Could try the latest version from GitHub and see if you still experience this bug?

remotes::install_github("david-cortes/isotree")

I tried to install the package. But it gives me an error. I tried remotes and devtools, both of them give me the same error when I install the package.

Error massage: "mult.hpp:959:13: warning: enumeration value 'Divide' not handled in switch [-Wswitch] make: *** [C:/PROGRA~1/R/R-41~1.0/etc/i386/Makeconf:245: Rwrapper.o] Error 1''

Warning message: "In i.p(...) : installation of package ‘C:/Users/Thomas/AppData/Local/Temp/RtmpgTgzmR/file319c4452737e/isotree_0.2.10.tar.gz’ had non-zero exit status"

david-cortes commented 3 years ago

Fixed again - could you give it another try now?

ThomasZhang717 commented 3 years ago

I tried the example codes and my codes. It seems that the issue is solved. Congrats. ;)