david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
192 stars 38 forks source link

Error in fit_model #8

Closed tararae7 closed 4 years ago

tararae7 commented 4 years ago

I get this error when running the following model, but I don't always get the error. Sometimes it runs fine.

Error in fit_model(pdata$X_num, pdata$X_cat, unname(pdata$ncat), pdata$Xc, : negative length vectors are not allowed

isotree_mdl2 <- isolation.forest(df,

david-cortes commented 4 years ago

Thanks for the bug report. Could you provide some more information:

david-cortes commented 4 years ago

Also the line number from the error and/or the full error trace would be helpful.

david-cortes commented 4 years ago

Come to think about it, I imagine this might be an issue with integer overflow. Do the inputs that you pass have a number of entries (rows x columns) greater than 2^31 (~2.2 billion)? Are you using an R version < 4.0.0?

tararae7 commented 4 years ago

Hi David,

Thanks for responding. I am passing 3 categorical variables with many levels and 2 categorical variables that are 0 or 1. Total of 5 features in the model essentially all are categorical. I am not converting to factors they are all 'chr'. Would it help to convert to factors? The total records in the data frame are 1.5 million, but you saw that i am using 256 in the subsets. I noticed as i was testing that everything staying the same but if i change ntrees from say 200 to 600 where 200 is working and 600 gives me the error. I am passing the nthreads because i was trying to figure out how to use parallel processing. Does that not work that way?

R version we are on is 3.4.2

tararae7 commented 4 years ago

I tried changing all my features to factors and got a different error.

Error in fit_model(pdata$X_num, pdata$X_cat, unname(pdata$ncat), pdata$Xc, : std::bad_alloc

david-cortes commented 4 years ago

Well, from what you got there:

negative length vectors are not allowed

Means one of the following: (a) some input has more than 2^31 entries (you can get around this limitation by updating R to 4.0.0); (b) some size calculation returned a negative number, which would be a bug. This error happens before the library C++ code even starts.

std::bad_alloc

Means one of the following: (a) you ran out of memory; (b) somewhere, some array with more than 2^64 entries is attempted to be created; (c) some size calculation returned a negative number. This error happens already in the library C++ code.

Most likely scenario: you are sub-setting the input that you pass, and are doing so incorrectly, with the actual number of rows x columns being between 2^31 and 2^64.

Try the following: assign the input data to some temporary object before calling isolation.forest, and run diagnostics on it. Please post the results of this:

the_model_input <- df
print(class(the_model_input))
print(NROW(the_model_input))
print(NCOL(the_model_input))
print(parallel::detectCores()-9)
isolation.forest(
    the_model_input,
    ntrees =600,
    sample_size=256,
    ndim=1,
    prob_pick_pooled_gain=0,
    prob_pick_avg_gain=0,
    penalize_range = FALSE,
    missing_action="fail",
    nthreads = parallel::detectCores()-9
)

If that still looks Ok, please try updating to R 4.0.0 or higher.

tararae7 commented 4 years ago

Not sure I understand how i would be sub-setting wrong, but i still get the error.

the_model_input <- isotree_features_sub print(class(the_model_input)) [1] "data.frame" print(NROW(the_model_input)) [1] 1462476 print(NCOL(the_model_input)) [1] 5 print(parallel::detectCores()-9) [1] 7 isotree_mdl<-isolation.forest(

david-cortes commented 4 years ago

Did you try reinstalling Rcpp after you updated the compiler?

tararae7 commented 4 years ago

Not sure but the version of Rcpp is 1.0.4.6

david-cortes commented 4 years ago

Please try reinstalling both Rcpp (first) and isotree (after reinstalling Rcpp) with the new compiler. It might be some issue with C++ headers not matching compiled code.

tararae7 commented 4 years ago

I can go request that to be done. Will that potentially fix both problems?

I tested this locally where i have R 4.0.2 and i am having the same problem, so I don't think that is the issue. Please let me know if you have any other suggestions

david-cortes commented 4 years ago

Oh well, then it won't help.

tararae7 commented 4 years ago

Hi David,

I am having to understand what parameter values are causing these errors, my plan is to create a grid search for ntrees and samples size. Can i ask you at what point do you know you have a sufficient about of trees and sample size for a good model? I am assuming the scores will converge is that accurate?

david-cortes commented 4 years ago

Yes, the scores should converge as the number of trees increases, but at which point that happens is hard to predict and will depend a lot on the inputs that you pass and the hyperparameters that you use.

Nevertheless, if you are getting these errors, chances are that the model is not doing what it should do and the scores will probably not make sense.

This time, I'm unable to reproduce the error, so I'll ask you to install and run this modified version of the library on your setup with the parameters that make it crash (ideally running it with both char and factor columns), and paste here the full logs that it produces when you call isolation.forest. Please use 3 backticks(`) before and after the output to have it formatted as plain text.

Also if possible, I'd like to ask you if you could upload some anonymized sample data which could reproduce the error. isotree_0.1.18.tar.gz

tararae7 commented 4 years ago

I am happy to do this test, but how do I know its not just a memory limitation?

david-cortes commented 4 years ago

Aside from running out of RAM, the only memory limitations there can be are on the input data sixze (your data is way below that limit, 2^31 entries), on the model size (cannot be bigger than 2GB, but no way it can reach such size with the parameters you're using, you can see the size in the environment pane in rstudio), and on the intermediate model objects (on an x86_64 computer, that's not going to be reached), so it likely is a bug somewhere.

So please test it, it's just about printing the sizes and types of the inputs and parameters at different points in the code.

If you want to make sure it's not a memory limitation, you can call the python version of the package through reticulate, which is not constrained by the 2^31/2GB limits from R. Should be something like this, assuming you have a configured python environment with the package installed:

library(reticulate)
isotree_lib <- import("isotree")
iso <- isotree_lib$IsolationForest(<your parameters go here>)
iso$fit(df)
tararae7 commented 4 years ago

I just tried to install the attached isotree package you included and received the following error. Can i use the version from install.packages(isotree)?

install.packages("C:/isotree_0.1.18.tar.gz", repos = NULL, type="source") WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:


Installing package into ‘C:/R/Library’
(as ‘lib’ is unspecified)
* installing *source* package 'isotree' ...
** using staged installation
** libs

*** arch - i386
Warning in system(cmd) : 'make' not found
ERROR: compilation failed for package 'isotree'
* removing 'C:/R/Library/isotree'
Warning in install.packages :
  installation of package ‘C:/isotree_0.1.18.tar.gz’ had non-zero exit status```
david-cortes commented 4 years ago

In windows, you need RTools to install packages like this from source.

tararae7 commented 4 years ago

Ok. Can I was able to install the 0.1.18 version from CRAN. Can i use that for the test?

david-cortes commented 4 years ago

No, that one doesn't have print statements. The one I uploaded here is the same but it has print statements at different points to see where exactly does it fail.

tararae7 commented 4 years ago

I was able to get RTools to be recognized but now i am getting this error.


Installing package into ‘C:/R/Library’
(as ‘lib’ is unspecified)
* installing *source* package 'isotree' ...
** using staged installation
** libs

*** arch - i386
"C:/R/Library/rtools40/mingw32/bin/"g++  -std=gnu++11 -I"C:/Program Files/R/R-40~1.2/include" -DNDEBUG -D_FOR_R -D_USE_MERSENNE_TWISTER -D_ENABLE_CEREAL -I'C:/R/Library/Rcpp/include' -I'C:/R/Library/Rcereal/include'     -fopenmp   -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign -c RcppExports.cpp -o RcppExports.o
sh: C:/R/Library/rtools40/mingw32/bin/g++: No such file or directory
make: *** [C:/Program Files/R/R-40~1.2/etc/i386/Makeconf:229: RcppExports.o] Error 127
ERROR: compilation failed for package 'isotree'
* removing 'C:/R/Library/isotree'
* restoring previous 'C:/R/Library/isotree'
Warning in install.packages :
  installation of package ‘C:/isotree_0.1.18.tar.gz’ had non-zero exit status```
david-cortes commented 4 years ago

Looks like an issue with RTools not having all the tools it needs. Perhaps you installed the wrong one? I see it says arch i386 (32-bit), rather than x86-64 or similar (64-bit).

Can you try to install it in the RServer version that you were using?

tararae7 commented 4 years ago

The version of RTools i installed, installs both 32 and 64 bit. For some reason its looking for the 32 bit. I am trying to figure out why.

david-cortes commented 4 years ago

That's probably because you have an R 32-bit install. Also RTools is not an R library, shouldn't be under C:/R/Library/rtools40 - did you manually install it there? If not, you might need to change that configuration to point to the folder in which RTools is installed - see this post for example: https://stackoverflow.com/questions/47539125/how-to-add-rtools-bin-to-the-system-path-in-r

david-cortes commented 4 years ago

I think I managed to make a version that would be installable without RTools, please try this one (will only work in Windows): isotree_0.1.18.zip

tararae7 commented 4 years ago

Thank you. This looked to have worked. Ill send the details shortly.

tararae7 commented 4 years ago

Not sure if i know what you mean by full log so let me know if this is not what you were referring to. This was run with factors.

''' isotree_mdl <- isolation.forest(isotree_features_sub,

david-cortes commented 4 years ago

Thanks. So the package code is working fine after all.

What happens there is: the resulting model object is too big for R to handle (in your server) and/or too big for your RAM (in the example you posted). And it is too big because you have a categorical column with 250k levels. Each tree that takes that column has to keep a vector indicating if a category belongs to one branch or the other, and that adds up quickly. Potential solutions: (a) decrease the number of categories per column; (b) call the python version through reticulate.

Also again: you're not going to get good results out of that with such sample sizes.

tararae7 commented 4 years ago

Is the negative length vectors are not allowed error the same issue? I think you are saying i wont get good results because of the large number of levels but that is why i am trying to increase the sample size and trees. What is an average number of levels that you would suggest? "Can you explain what you mean by "You're not going to get good results out of that with such samples sizes" Sorry if you mentioned this before and I missed it.

david-cortes commented 4 years ago

Yes, it's the same issue (R limitation) - you can google "integer overflow". In your desktop case you additionally ran out of RAM (std::bad_alloc).

About the number of categories: I don't know. You probably shouldn't expect good performance with more than 5-10 categories per column unless you create much bigger models by many orders of magnitude. That's because whatever averaged random patterns are determined from such samples will deviate a lot from the expected value [if the number of trees were infinite and the samples per tree equal to the number of rows], and will likely just pick some random different category in each tree split with little overlap across such limited trees (that is, the scores will look closer to random uniform).

If you're not convinced, you can try a small simulation as follows: try to estimate the percentage of each category level from that column with 250k levels by drawing 400 random samples of 1k observations, then average the obtained proportions over those 400 samples and see how much it differs from the real proportions.

tararae7 commented 4 years ago

thank you