OutlierDetectionJL / OutlierDetection.jl

Fast, scalable and flexible Outlier Detection with Julia
https://outlierdetectionjl.github.io/OutlierDetection.jl/dev/
MIT License
79 stars 8 forks source link

[BUG] MLJ pipelines do not work with outlier detectors #31

Open hpaldan opened 1 year ago

hpaldan commented 1 year ago

Describe the bug

I have a problem using the UnsupervisedDetector models in a pipeline. I have tried a two different simple linear pipelines, one with a standardizer and LOFDetector and one with standardizer and IForestDetector. It seems like the fit! function doesn't work properly on the detector models when they are in a pipeline since no training seems to take place and when I try to transform new data with the machine it gives an error message: "ERROR: MethodError: objects of type OutlierDetectionPython.IForestDetector are not callable"

To Reproduce

Hopefully the code example isn't too long.

using Pkg

Pkg.add("MLJ")
Pkg.add("OutlierDetection")
Pkg.add("DataFrames")
using MLJ
using OutlierDetection
using DataFrames

fake_dataframe = DataFrame(A=rand(100).-10 .*10,B= rand(100).+10 .*10)

#Load models
LOF = @iload LOFDetector() pkg= OutlierDetectionNeighbors 
IForest = @iload IForestDetector() pkg = OutlierDetectionPython

#Instantiate models
model_standardizer = Standardizer();
model_IForest = IForest();
model_LOF = LOF();

#Create pipelines
pipe_standardized_LOF = model_standardizer |> model_LOF
pipe_standardized_Iforest = model_standardizer |> model_IForest

#Create machines
mach_standardizer_LOF = machine(pipe_standardized_LOF,fake_dataframe)
mach_standardizer_Iforest = machine(pipe_standardized_Iforest,fake_dataframe)
mach_LOF = machine(model_LOF,fake_dataframe)

#fit machines
fit!(mach_standardizer_LOF);
fit!(mach_standardizer_Iforest);
fit!(mach_LOF);

#=
Here the transformation gives an error for pipelines but not for a single machine.
=#
fake_dataframe_1 = MLJ.transform(mach_standardizer_LOF,fake_dataframe)
fake_dataframe_2 = MLJ.transform(mach_standardizer_Iforest,fake_dataframe)
fake_dataframe_3 = MLJ.transform(mach_LOF,fake_dataframe)

# Trying another unsupervised model to rule out that 
#all unsupervised models doesn't work:

KMeans = @iload KMeans pkg=ParallelKMeans
model_KMeans = KMeans();
pipe_standardized_KMeans = model_standardizer |> model_KMeans
mach_standardizer_KMeans = machine(pipe_standardized_KMeans,fake_dataframe);
fit!(mach_standardizer_KMeans);

fake_dataframe_4 = MLJ.transform(mach_standardizer_KMeans,fake_dataframe)

Expected behavior

I expect the transform function to output an anomaly score from a machine that first standardizes the data and then do some kinde of detector model on it.

Additional context

I have tried the same thing (as is in the code above) with other unsupervised models and it seems to work fine on them so the problem is probably isolated to the OutlierDetection package. I've also tried a PCA model instead of a standardizer with a oulierdetection model in a pipeline with the same problem.

Versions

Please run the following code snippet and paste the output here: from sktime import show_versions; show_versions() <--- I didn't get this one to work at all so I will send the information from versioninfo instead. From versioninfo: Julia Version 1.6.7 Commit 3b76b25b64 (2022-07-19 15:11 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, skylake) Environment: JULIA_EDITOR = code JULIA_NUM_THREADS =
hpaldan commented 1 year ago

I totally didn't understand that the arrows was for comments.. Rookie mistake.

hpaldan commented 1 year ago

I still hade to make some minor fixes on my bad description.

davnn commented 1 year ago

Hey! The reason might be that pipelines only support

const SUPPORTED_TYPES_FOR_PIPELINES = [
    :Deterministic,
    :Probabilistic,
    :Interval,
    :Unsupervised,
    :Static]

models, but outlier detection algorithms are currently modeled as a separate entity (Annotator <: Model) in MLJ.

  1. I'm not sure if that's really the reason for the mentioned error
  2. I'm not sure if it would make sense to add support for annotators to pipelines because then we would also have to add support to a lot of other scattered places all over MLJ. I would prefer to subtype Detector directly from Unsupervised or Supervised, but that too would require some major changes.

In the meantime, however, you could directly use the learning networks API to achieve your desired pipeline:

using MLJ
using OutlierDetection
using DataFrames

fake_dataframe = DataFrame(A=rand(100) .- 10 .* 10, B=rand(100) .+ 10 .* 10)

#Load models
LOF = @iload LOFDetector() pkg = OutlierDetectionNeighbors
IForest = @iload IForestDetector() pkg = OutlierDetectionPython

#Learning networks
Xs = source(fake_dataframe)
Xstd = MLJ.transform(machine(Standardizer(), Xs), Xs)
lof_mach = MLJ.transform(machine(LOF(), Xstd), Xstd)
forest_mach = MLJ.transform(machine(IForest(), Xstd), Xstd)

fit!(lof_mach)
lof_mach(fake_dataframe)

fit!(forest_mach)
forest_mach(fake_dataframe)
hpaldan commented 1 year ago

All right, too bad that the fix would require that much work. Thank you for the fast reply and good guidance!