JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.8k stars 157 forks source link

Discussion: Outlier Detection API in MLJ #780

Closed davnn closed 2 years ago

davnn commented 3 years ago

Hi everyone,

I would like to discuss how an outlier/anomaly detection API could look like in MLJ.

Is your feature request related to a problem? Please describe.

I'm working on a package for outlier detection, OutlierDetection.jl, and its MLJ integration, but there are a couple of problems I stumbled into.

What do I mean when talking about Outlier Detection:

To differentiate between outlier detection and classification, I would rather consider (imbalanced) one-class classification as binary classification task. It looks like binary classification is quite straight forward to implement in MLJ, but working with scores is tricky. The most challenging aspect is accessing a fit result (training scores) later on in a pipeline/network when deciding to convert outlier scores to binary labels.

Additionally, it should be possible to evaluate unsupervised (ranking) algorithms with evaluate, but that should be relatively straight forward I believe.

Describe the solution you'd like

Provide a beautiful API for outlier detection with MLJ.

Currently, transform/predict return train/test score tuples, which makes it possible to use learning networks / ensembles, but I would like to discuss if that's a feasible API design, or if something like a wrapped ensemble model makes more sense?

Let's see how the current API looks like in practice with MLJ. KNN is an outlier detection model and Binarize transforms tuples of train/test scores to binary labels, X is a dataframe and train/test are indices.

Basic usage:

std = machine(Standardizer(), X)
fit!(std, rows=train)

mach = machine(KNN(), transform(std, X))
fit!(mach, rows=train)

scores = transform(mach, rows=test);
binarize = machine(Binarize())
transform(binarize, scores)

Linear pipeline:

pipe = @pipeline Standardizer() KNN(k=10) Binarize()
mach = machine(pipe, X);
fit!(mach, rows=train);
transform(mach, rows=test)

Learning networks:

nodemachine(model, X) = transform(machine(model, X), X)

Xs = source(X);
Xstd = nodemachine(Standardizer(), Xs)
scores = map(k -> nodemachine(KNN(k=k), Xstd), [5, 10, 15, 20, 25, 30])

network = transform(machine(Binarize()), scores...)
fit!(network, rows=train)
model_network(rows=test)

I would love to hear your feedback!

Describe alternatives you've considered

At first, I tried to use the fitted_params/reportfunctionality to access training scores (achieved when fitting the model), which works alright in toy examples with a single model, but doesn't work with learning networks / ensembles.

Additional context

Previous discussion about Anomaly detection integration: https://github.com/alan-turing-institute/MLJ.jl/issues/51

ablaom commented 3 years ago

@davnn Thanks indeed for looking into this. I should like to help out with this discussion but currently am on leave until 10th May. More then.

ablaom commented 3 years ago

Okay, OutlierDetection.jl looks like substantial piece of work! We should definitely make a nice MLJ API to handle it.

Let me start by acknowledging a super effort here to wrap your head around the MLJ API. I realize this takes a bit of work, and I can see the evidence of this in your code.

I am not so familiar with OD so I will start with a couple of naive questions.

  1. In your approach you want to break up anomaly detection into two parts - a training step whose main purpose is construct a scoring function, which can be applied to production data; and a second "static transformer" step that converts the scores of production data into outlier/inlier labels, but which also needs to see the scores of the training data (to construct a score histogram, needed to label on the basis of a specified outlier_fraction). Is this approximately correct?

  2. Given the extra complication presented by what I have italicised above, I wonder what the use case for separating these two steps? Could you say more about this?

  3. Also, does it not make more sense for the first training step to take responsibility for constructing a probabilistic scoring function (on the basis of the observed training score histogram, for example) which would eliminate the need for the "static transformer" (binarizer) to have a complicated input. (Indeed, the existing MLJ.BinaryThresholdClassifier might then suffice).

  4. It appears you have constructed a good deal of infrastructure for evaluating these OD models. It might be good to explore the possibilities for re-using some of the existing MLJ apparatus for this - at the least the classification metrics, which include probabilistic metrics such as log_loss and brier_score, and growing. Have you already looked into this at all?

davnn commented 3 years ago
  1. In your approach you want to break up anomaly detection into two parts - a training step whose main purpose is construct a scoring function, which can be applied to production data; and a second "static transformer" step that converts the scores of production data into outlier/inlier labels, but which also needs to see the scores of the training data (to construct a score histogram, needed to label on the basis of a specified outlier_fraction). Is this approximately correct?

Right, that's how I would differentiate outlier detection from rare class classification.

  1. Given the extra complication presented by what I have italicised above, I wonder what the use case for separating these two steps? Could you say more about this?

Using scores enables interesting ensemble learning approaches. E.g. (1) being able to use different detectors on the same data and (2) being able to use different detectors on different data. I think it's best to show examples.

First prepare some example data.

using MLJ # or MLJBase
using OutlierDetection # installed from master
X, train, test = rand(1000, 10), 1:500, 501:1000;
nodemachine(model, X) = transform(machine(model, X), X)

Example for use case (1), multiple detectors, single data:

Xs = source(table(X));
scores_knn = map(k -> nodemachine(KNN(k=k), Xs), [5, 10, 15])
scores_lof = map(k -> nodemachine(LOF(k=k), Xs), [5, 10, 15])
network = transform(machine(Binarize()), scores_knn..., scores_lof...)
fit!(network, rows=train)
network(rows=test)

Example for use case (2) multiple detectors, multiple data:

Xs1 = source(table(X[:,1:5]))
Xs2 = source(table(X[:,6:10]))
scores_knn = map(k -> nodemachine(KNN(k=k), Xs1), [5, 10, 15])
scores_lof = map(k -> nodemachine(LOF(k=k), Xs2), [5, 10, 15])
network = transform(machine(Binarize()), scores_knn..., scores_lof...)
fit!(network, rows=train)
network(rows=test)

I would say both use cases are quite common, but there are, as far as I know, no libraries that support those use cases nicely (also in other programming languages).

  1. Also, does it not make more sense for the first training step to take responsibility for constructing a probabilistic scoring function (on the basis of the observed training score histogram, for example) which would eliminate the need for the "static transformer" (binarizer) to have a complicated input. (Indeed, the existing MLJ.BinaryThresholdClassifier might then suffice).

That would certainly be a nice way to approach the problem. I have to read a bit more about the topic as there are many possible ways to create such a probabilistic scoring function, e.g. unify, all with their pros and cons.

Edit: What do you think about transform returning the raw scores and predict returning the probabilistic scores? Also, I just returned a UnivariateFinite from unsupervised predict and observed that predict_mode doesn't exist for unsupervised methods.

Edit2: Is BinaryThresholdClassifier still in MLJ?

  1. It appears you have constructed a good deal of infrastructure for evaluating these OD models. It might be good to explore the possibilities for re-using some of the existing MLJ apparatus for this - at the least the classification metrics, which include probabilistic metrics such as log_loss and brier_score, and growing. Have you already looked into this at all?

Not very deep - I will definitely look into it more thoroughly. Generally, it appears that methods like evaluate or learning_curve are only designed for supervised methods. I don't plan to offer evaluation metrics with OutlierDetection.jl, thus will try to get it to work smoothly with MLJ's builtins.

davnn commented 3 years ago

I experimented a bit with different strategies regarding scoring and evaluation.

Regarding (3): In general, I like the idea of using the first training step to take responsibility for constructing a probabilistic scoring function, but I’m not sure how to implement it properly. What I don’t like about that design currently:

Thus, still not really sure how to proceed with this API.

Regarding (4): I investigated evaluation and everything works pretty well once you transform your unsupervised models into supervised surrogates.

ablaom commented 3 years ago

Thanks for these further explanations and for your continued patience.

I do think we are making progress and I have a few ideas about the API. However, allow me to press you on some details that I don't quite understand yet.

Regarding (3): In general, I like the idea of using the first training step to take responsibility for constructing a probabilistic scoring function, but I’m not sure how to implement it properly. What I don’t like about that design currently:

  • You’d have to specify a normalization strategy and outlier threshold for each model in advance. Once you are working with learning networks you have a single threshold and (typically) a single normalization strategy for the network, meaning that you would have to override the individual settings and the API would differ between a single model and a network.
  • I’m not sure how you would get back to the raw scores, which are often useful because if you normalize on the training data, high outliers in prediction (normalized to a probability of 1) are not differentiated anymore from very high outliers.
  • Due to the fact that the scores represent a ranking and not a probability, the whole concept of a probabilistic scoring function feels a bit strange, because the decision if an instance is an inlier or outlier is a completely arbitrary threshold based on domain knowledge. This is also the reason why predict_proba only makes sense given a predefined outlier threshold.

This is very helpful. I'm a little confused still why you need the threshold to construct the probability scoring rule. Indeed, in your vanilla implementation, it seems the threshold is not even part of the user input; rather the usr specifies the fraction_train (ie, the cut-off probability instead of cut-off raw score) and for this purpose all you require is the histogram constructed from the raw training scores. Isn't this essentially what this code is doing?

If I'm missing the point here (quite likely) that's fine. If however this is easily explained, this would be helpful. More important is that I really understand your use cases for separating the raw-scoring from the classification. Is the idea in use case (1) that you want to pool the raw training scores for multiple detectors, and simultaneously generate vector-valued scores for each new observation - one element per detector? And that your Class object is designed to process not just scalar scores, but vector scores, according to some combine function?

Regarding just case (2) (multiple data) I'm not sure this can be immediately fit into the learning networks scheme, unless you begin with a step that splits some super data table X into the two pieces. Some thing to keep in mind is that while learning networks can be used in the raw, their main raison d'etre is as blueprints for new stand-alone model types. So what kind of model has multiple data sets for the "input features"? Certainly nothing in the current design. Also, it seems that you want to output final classes (on production data) for just one of the datasets, (one piece of the input data) which seems odd to me. Is this really what you want? Perhaps you can explain in words the overall task you are trying to realize here (as distinct from the implementation) separating the training part and the production data detection part?

Edit2: Is BinaryThresholdClassifier still in MLJ?

Very sorry, its BinaryThresholdPredictor.

At first, I tried to use the fitted_params/report functionality to access training scores (achieved when fitting the model), which works alright in toy examples with a single model, but doesn't work with learning networks / ensembles.

Whether or not this becomes an interface we use, it is possible to make this work in the learning network context. The idea is that instead of report(mach) you need to do r = node(report, mach) to define a Node object r such that r() returns the machine's report. Ditto for fitted_params. (It would be nice to overload the methods to make this redundant, but type ambiguities make this impossible.) Here's a contrived example of training an EvoTreeRegressor on just the two best features, with the appropriate feature names extracted from the report's feature_importance key:

using MLJ

EvoTreesRegressor = @iload EvoTreeRegressor

# helper to extract the names of the 2 most important features from a
# `report`:
function important_features(report)
    importance_given_feature = report.feature_importances
    names = first.(sort(collect(importance_given_feature), by=last, rev=true))
    return Symbol.(names[1:2])
end

# model and dummy data:
rgs = EvoTreesRegressor()
X, y = @load_boston

# # LEARNING NETWORK

Xs = source(X)
ys = source(y)

# define a node returning best feature names:
mach = machine(rgs, Xs, ys)
r = node(report, mach)  # now `r()` returns a report
features = node(important_features, r)

# define a node returning the reduced data:
Xreduced = node((X, features) -> selectcols(X, features), Xs, features)

# build a new machine based on reduced data:
mach2 = machine(rgs, Xreduced, ys)

# define the prediction node for model trained on reduced data:
yhat = predict(mach2, Xreduced)
fit!(yhat)
yhat()
ablaom commented 3 years ago

Mmm. Now I think about it, pooling scores from different detectors doesn't make sense. I'm pretty confused about what it is you want here. How do interpret this signature network = transform(machine(Binarize()), scores_knn..., scores_lof...)?

davnn commented 3 years ago

Hi again, sorry for the late reply, life is quite busy at the moment.

This is very helpful. I'm a little confused still why you need the threshold to construct the probability scoring rule. Indeed, in your vanilla implementation, it seems the threshold is not even part of the user input; rather the usr specifies the fraction_train (ie, the cut-off probability instead of cut-off raw score) and for this purpose all you require is the histogram constructed from the raw training scores. Isn't this essentially what this code is doing?

You're right you would only have to specify a normalization strategy in advance, the threshold would only be necessary when converting the scores to classes and the mentioned BinaryThresholdPredictor would likely be a good fit. We would probably have to warn the user if they use predict_mode or disable it, though, as a naive user might expect this to represent the inlier/outlier classes.

If I'm missing the point here (quite likely) that's fine. If however this is easily explained, this would be helpful. More important is that I really understand your use cases for separating the raw-scoring from the classification. Is the idea in use case (1) that you want to pool the raw training scores for multiple detectors, and simultaneously generate vector-valued scores for each new observation - one element per detector? And that your Class object is designed to process not just scalar scores, but vector scores, according to some combine function?

In the current API that's the case, yes. Class takes the responsibility for converting one or more raw scores, normalizes them and combines them through some function, e.g. mean, and applies the threshold to the final score vector. The Score class simply normalizes and combines one or more scores to a final vector of probabilities. If we push down the responsibility for creating a probabilistic scoring function to the individual detectors and use the BinaryThresholdPredictor, these classes would become unnecessary and only the combination of scores would require some helper function.

Regarding just case (2) (multiple data) I'm not sure this can be immediately fit into the learning networks scheme, unless you begin with a step that splits some super data table X into the two pieces. Some thing to keep in mind is that while learning networks can be used in the raw, their main raison d'etre is as blueprints for new stand-alone model types. So what kind of model has multiple data sets for the "input features"? Certainly nothing in the current design. Also, it seems that you want to output final classes (on production data) for just one of the datasets, (one piece of the input data) which seems odd to me. Is this really what you want? Perhaps you can explain in words the overall task you are trying to realize here (as distinct from the implementation) separating the training part and the production data detection part?

I think multiple data was the wrong formulation; in the end it's about splitting the features of a single dataset to multiple detectors, which admittedly is mainly useful specific use cases. In the code example, I learn multiple KNN detectors for features 1 to 5, and multiple LOF detectors for features 6:10, thus the output should be for the whole dataset, not one feature subset. I think this feature, however, does not need any special API is it appears to be working quite well already.

Whether or not this becomes an interface we use, it is possible to make this work in the learning network context. The idea is that instead of report(mach) you need to do r = node(report, mach) to define a Node object r such that r() returns the machine's report. Ditto for fitted_params. (It would be nice to overload the methods to make this redundant, but type ambiguities make this impossible.) Here's a contrived example of training an EvoTreeRegressor on just the two best features, with the appropriate feature names extracted from the report's feature_importance key:

Very interesting, thanks for explaining how this works! I will have a look if this could improve the API.


I think the main open point is how to implement the probabilistic normalization such that it works for individual models and learning networks consisting of multiple combined models. I did a proof of concept enriching each detector with a normalize::Function field, which receives the training scores and test scores and per default uses min-max normalization. However, if you want to change all your normalization functions in a learning network it's repetitive to set it for all detectors individually, but I guess that's not too problematic.

One more open points from previously is how could possibly access the raw scores?

I’m not sure how you would get back to the raw scores, which are often useful because if you normalize on the training data, high outliers in prediction (normalized to a probability of 1) are not differentiated anymore from very high outliers.

ablaom commented 3 years ago

@davnn Thanks again for your patience. I think I understand the goals sufficiently now and a design is coalescing in my mind. Before getting your feedback, I am playing around with a POC to make sure of some details. I hope to get back to you soon.

davnn commented 3 years ago

@davnn Thanks again for your patience. I think I understand the goals sufficiently now and a design is coalescing in my mind. Before getting your feedback, I am playing around with a POC to make sure of some details. I hope to get back to you soon.

Thank you! FYI: I'm currently working on what I think is a nice API for MLJ, however I think that this API is mainly suited for single-model use-cases, not the more flexible use cases described above. The design is as follows (for both supervised and unsupervised models):

Evaluation is still a bit problematic as you have to write a wrapper to turn unsupervised models into supervised surrogates (not possible e.g. for a library only depending on the model interface). Something like

function MLJ.evaluate(model::T, X, y, measure; args...) where {T <: Unsupervised}
    ptype = prediction_type(measure)
    @assert ptype in (:probabilistic, :deterministic)
    Xs, ys = source(X), source(y)
    # transform unsupervised model to supervised surrogate
    mach = ptype == :probabilistic ?
        machine(Probabilistic(), Xs, ys, predict=predict(machine(model, Xs), Xs)) :
        machine(Deterministic(), Xs, ys, predict=predict_mode(machine(model, Xs), Xs))
    evaluate!(mach; measure=measure, args...)
end

Unfortunately for the use cases above with multiple models, this API is not nice to work with, and would likely require some ensemble / meta model wrappers.

EDIT: If you would like to check the API out, I pushed the changes to the normalization-extended branch. https://github.com/davnn/OutlierDetection.jl/tree/normalization-extended

ablaom commented 3 years ago

Okay the POC is taking a little longer, so in the mean time here is a sketch of my proposal. I will continue my work on the POC after getting your feedback. If you are able to provide this soon, I should be able to push the POC along soon also.

There are three types of model objects I think we should cater for in a design:

I propose introducing two new types:

Bare detectors

BareDetector <: Model

The fit signature

I recommend that we combine unsupervised/supervised/semi-supervised cases by always demanding data X be paired with outlier/inlier labels y, where y is a two-class ordered categorical vector (element scitype OrderedFactor{2}) with the same number of rows, but which is allowed to have any number (even zero) missing values. This has three advantages:

The down side may be a bit of confusion on the part of the users. "Why do I need to specify y for unsupervised detection?". This could possibly be mitigated by helpful error messages and a convenient shortcut for categorical(missings(String, 100), levels=["outlier", "inlier"], ordered=true)??

methods

traits

Integrated detectors

Detector <: Model

These have the same fit signature as bare detectors

methods

There is a question of whether to allow integrated detectors that can only predict the class (no probability). Then I would proabaly suggest two types ProbabilisticDetector and DeterministicDetector. Are there use cases for this?

traits

I've thought about thinking about viewing Detector as a subtype of Probabilisitic (one with a hard-wired target_scitype) but this implies they are also Supervised, which feels weird. It would reduce the API extensions.


A detail I will note in case it is relevant to your own experiments: The training_scores should be wrapped to prevent them being subsampled in evaluation. By default any table, matrix or vector get subsampled in the evaluation apparatus, anything else is left alone.

davnn commented 3 years ago

First of all, I'm very grateful for your input @ablaom, thank you!

There are three types of model objects I think we should cater for in a design:

I would merge the bare detectors and integrated detectors, by adding a transform method to the integrated detectors that returns the raw scores (see https://github.com/alan-turing-institute/MLJ.jl/issues/780#issuecomment-893186557). Are the other reasons for the split, that I did not think of?

predict(model, fitresult, Xnew) always returns a UnivariateFiniteVector of probabilities, using the labels inferred from the data y passed to fit

I'm not sure about this. Do you propose to use the order (OrderedFactor{2}) to determine which of the labels is inlier/outlier? E.g. first has to be inlier, last outlier? Not sure how user friendly that would be? Also, it feels unexpected to use OrderedFactor for something that's not ordered?

There is a question of whether to allow integrated detectors that can only predict the class (no probability).

I would start with probabilistic only as I'm not aware of a model that doesn't fit into this schema.

target_scitype is hard-wired to be OrderedFactor{2}

Does the target scitype always refer to the output of predict? It would then be AbstractVector{<:Finite}, right?

I'm actually quite happy with the following API (check out the normalization-extended branch if you would like to try it out):

using MLJ
using OutlierDetection # installed from normalization-extended branch
X, train, test = rand(1000, 10), 1:500, 501:1000;
Xs = source(table(X));

# this helper function should probably be provided
scores(mach) = node((mach, Xs) -> (report(mach).scores, transform(mach, Xs)), mach, Xs)

# basic use case (integrated detectors)
mach = machine(KNN(), X)
fit!(mach, rows=train)
transform(mach, rows=test)      # raw scores
predict(mach, rows=test)        # normalized scores / probas
predict_mode(mach, rows=test)   # labels

# (1), multiple detectors, single data:
machines_knn = map(k -> machine(KNN(k=k), Xs), [5, 10, 15])
machines_lof = map(k -> machine(LOF(k=k), Xs), [5, 10, 15])
knn_scores, lof_scores = scores.(machines_knn), scores.(machines_lof)
network = transform(machine(Scores()), knn_scores..., lof_scores...)
fit!(network, rows=train)
network(rows=test)

# (2) multiple detectors, multiple data:
Xs1 = source(table(X[:,1:5]))
Xs2 = source(table(X[:,6:10]))
scores_knn = map(k -> machine(KNN(k=k), Xs1), [5, 10, 15])
scores_lof = map(k -> machine(LOF(k=k), Xs2), [5, 10, 15])
network = transform(machine(Scores()), knn_scores..., lof_scores...)
fit!(network, rows=train)
network(rows=test)

Remaining "warts" that come to my mind:

By the way, I'm on vacation the next two weeks.

davnn commented 3 years ago

Hi @ablaom,

sorry for my rushed previous answer, I wanted to make the proposal before my vacation. I just read https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/ and now understand your idea behind using OrderedFactor{2} as a target_scitype as this seems to be the way to do things in MLJ.

Regarding the output_scitype you wrote

output_scitype is the type of the scores

Would you fix the type of the raw scores to Float64 or keep them at any <:Real? In all cases, output_scitype of a Detector.transform would be <:MMI.Continuous? For the static helpers, the output_scitype would be MMI.OrderedFactor{2} for the case of predicting labels, right? What would the output_scitype of predicting UnivariateFiniteVector{OrderedFactor{2}} be?

Edit: Regarding the static helpers; it doesn't seem possible to infer the levels because they only access the raw score vectors. Do you think it would be problematic to fix the levels for OutlierDetection and have the users transform their data to fit?


Regarding "treating unsupervised models as probabilistic", I have to experiment with this approach in the next days.

ablaom commented 3 years ago

@davnn Thanks for the update. I am close to finishing a POC which should accomodate the evalution business without wrapping detectors as supervised models. I am now convinced that: (i) detectors should get their own new abstract types, and (ii) normalised scoring should be viewed as just a special case of raw score prediction, rather than "optional" functionality with distinct interface (for composability). I hope you will be patient a little longer while I finish this more detailed proposal which I am confident will meet all your use-cases and be extensible.

ablaom commented 3 years ago

Hi @davnn,

Okay, I have finished my POC. Thanks all the feedback, the new branch and the script. This has been very helpful!

Please ignore part (ii) of my previous comment. I think we are nearly on the same page now.

Quick overview of the proposal implemented by the POC.

To give you a quick flavor of some of this, here is your excellent learning network (1) re-factored to mesh with the proposal:

# data:
X, train, test = rand(1000, 10), 1:500, 501:1000;

# detectors:
knn_detectors = [KNN(k=k) for k in [5, 10, 15]]
lof_detectors = [LOF(k=k) for k in [5, 10, 15]]
detectors = vcat(knn_detectors, lof_detectors)

# network:
Xs = source(table(X));
machs = map(d -> machine(d, Xs), detectors)
augmented_scores = map(m -> augmented_transform(m, Xs), machs)
score_transformer = OutlierDetection.Scores()   # modified version! see Note below
augmented_probs = transform(machine(score_transformer), augmented_scores...)
probs = last(augmented_probs)
fit!(probs, rows=train)

julia> probs(rows=test)
500-element UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
  UnivariateFinite{OrderedFactor{2}}(inlier=>0.153, outlier=>0.847)
  UnivariateFinite{OrderedFactor{2}}(inlier=>0.147, outlier=>0.853)
  UnivariateFinite{OrderedFactor{2}}(inlier=>0.235, outlier=>0.765)
  UnivariateFinite{OrderedFactor{2}}(inlier=>0.513, outlier=>0.487)
  UnivariateFinite{OrderedFactor{2}}(inlier=>0.35, outlier=>0.65)

You won't be able to run this without KNN and LOF buying into the enhanced API, but I have tested this with dummy detectors that do.

Trying the proposal out

You can see how the wrappers ProbabilisticDetector and BinaryThresholdPredictor are to be used here, which also demonstrates evaluation. The first wrapper is implemented by essentially generalizing the above learning network and exporting as a stand-alone model type.

You can see an example of implementing the basic UnsupervisedDetector API here.

Note. To make my wrapper work, I needed to make one hack to your code that has the Scores transformer return augmented probabilities, not just the test ones. I think this is pretty natural. Also in that hack, I redefine your CLASS_NORMAL, CLASS_OUTLIER to have values defined in MLJModelInterface.

Points for further discussion

I do hope this proposal can work for you with little modification, as I don't really have a lot more time to spend on this, I'm afraid. Still, I value your feedback and there are still smaller details to be decided on.

1. Although the new interface does not require OutlierDetection.jl to supply "bare" versions of the detectors (supertypes UnsupervisedDetector, SupervisedDetector) the only reason I can see for not doing so is to save the user having to wrap using ProbablisticDetector/BinaryThresholdPredictor. (And, in a more flexible compositional design, one expects there to be some price to pay in terms of convenience, no?) Here's an example of how forcing numerical scoring models to also provide probabilistic/classifying functionality is, to me, a bit weird: Suppose I want to combine two detectors using the combine/normalize apparatus. Then the composite model will have parameter normalize, but so will the component detectors. But these nested normalize hyperparameters are essentially dead. They have no effect on the behavior of the composite, and so are confusing:

julia> ProbabilisicDetector(knn=KNN(), lof=LOF())
ProbabilisticUnsupervisedDetector(
    normalize = OutlierDetection.normalize,
    combine = OutlierDetection.combine_mean,
    knn = KNN(
              k = 5,
              metric = Distances.Euclidean(0.0),
              algorithm = :kdtree,
              leafsize = 10,
              reorder = true,
              parallel = false,
              reduction = :maximum,
              normalize = OutlierDetection.normalize, <--- dead
              classify = OutlierDetection.classify,   <--- dead
              threshold = 0.9)
    lof = LOF(
              k = 5,
              metric = Distances.Euclidean(0.0),
              algorithm = :kdtree,
              leafsize = 10,
              reorder = true,
              parallel = false,
              normalize = OutlierDetection.normalize, <--- dead
              classify = OutlierDetection.classify,   <--- dead
              threshold = 0.9)) @163

I suppose you could keep the strapped on functionality in those models for using independently of MLJ (and suppress the extra parameters from the MLJ facing structs) but that seems a bit drastic seeing how well integrated we would otherwise be.

2. Pipelines. Given the possibilities provided by the wrappers, do you think there's going to be a big use case for including detectors in linear pipelines? In any case, this is probably going to have to wait for https://github.com/alan-turing-institute/MLJ.jl/issues/594

3. Where should code go? Assuming you are happy with the design, we may want to put all the "meta-functionality" in a separate repo, so others can re-use it. This would include both your Scores transformer, and the new ProbablisticDetector wrapper, which depends on the Scores and MLJBase (because it is implemented using learning networks). Any objections to this idea? (I expect you're plan is to exclude MLJBase as a dependency of OutlierDetection.jl, right?)

4. Do you have objections to the changes to the Scores transformer output indicated above?

5. Since we are essentially introducing a new task, we could easily rename transform everywhere above as score. I'm inclined to do that, but what do you think? We would retain the name predict to emphasize the commonality with Supervised model evaluation.

6. What about transformers that drop observations in data (or substitute with missing values) according to the output of a deterministic detector?

davnn commented 3 years ago

You can see how the wrappers ProbabilisticDetector and BinaryThresholdPredictor are to be used here,

It looks like the repo doesn't exist yet?

ablaom commented 3 years ago

My apologies. Private repo. I've invited you.

davnn commented 3 years ago

My apologies. Private repo. I've invited you.

Sorry but I didn't get an invite

ablaom commented 3 years ago

Oh bother. An unfortunate slip on my part. There is a davnnn as well. You should have it now, and I have copied the invite to someone with your name in the julia slack channel.

davnn commented 3 years ago

Thanks a lot for your input! I'm still working on it. How would you split the packages for outlier detection models? Each model in an individual package? Bundle packages with similar "backend", e.g. OutlierDetectionNeighbors, OutlierDetectionFlux, OutlierDetectionPy..,? Or keep all models in one repo?

ablaom commented 3 years ago

How would you split the packages for outlier detection models?

Of course that's up to you. I've spent a lot of time breaking MLJ functionality up into different packages and wish I had started out more modular. From the point of view of the MLJ user, splitting them won't make them less discoverable, so long as all the packages are "registered" with MLJ. The user will be able to do, eg, models(m -> m.abstract_type <: UnsupervisedDetector) to find all unsupervised detectors, and the names of the packages that provide them (and so where they might look for documention, which MLJ does not typically provide). It does sound like a good idea to split the models according to backends. For example, I'd rather not load Zygote if I don't need it. Further fragmentation sounds unnecessary, but you are in a better position than I to decide this.

As I say, I would put any meta-models (eg, the Probabilistic wrapper I'm proposing) in their own package (MLJOutlierDetection). And it would be great if you would be happy to maintain this over the longer term. If you're happy with the proposal, I can set MLJOutlierDetection up as a package providing the wrapper if you need me to.

davnn commented 3 years ago
  • transform is for returning numerical scores.

👍

  • We add a new compulsory augmented_transform method, which augments scores with training scores (eliminates previously awkward interface point, and inspired by your scores helper function).

I don't think it's necessary to add this new method. If we standardize the report key (e.g. .scores), we can simply use scores(mach) = node((mach, Xs) -> (report(mach).scores, transform(mach, Xs)), mach, Xs).

  • Additionally, training scores should be part of the fitresult to expose them in composite detector types.

👍

  • Models implementing the above have type UnsupervisedDetector or SupervisedDetector.

👍

  • predict, if implemented, returns either probabilistic scores (subtype AbstractProbabilisticUnsupervised or AbstractProbabilisticSupervised) or actual "inlier"/"outlier" labels (subtype AbstractDeterministicUnsupervised or AbstractDeterministicSupervised). The new abstract types are subtypes of either UnsupervisedDetector or SupervisedDetector above.

👍

  • We provide a wrapper ProbablisticDetector(detector(s), combine=..., normalize=...) for constructing a probabilistic detector, from one or more generic detectors.

👍

  • We use BinaryThresholdPredictor from MLJModels to convert probabilistic detectors to deterministic ones

👍 Additionally we provide a DeterministicDetector(detector(s), combine=..., normalize=..., classify=....) to unify/simplify the API.

  • A deterministic detector can be evaluated using MLJ deterministic measures, such as accuracy, while any probabilistic detector can be evaluated using MLJ probabilistic measures, such as auc and log_loss ~(or, a mixture of such measures, after this pending PR).~ or a mixture of probabilistic and deterministic measures (where a deterministic measure implies a threshold of 0.5).

👍

1. Although the new interface does not require OutlierDetection.jl to supply "bare" versions of the detectors (supertypes UnsupervisedDetector, SupervisedDetector) the only reason I can see for not doing so is to save the user having to wrap using ProbablisticDetector/BinaryThresholdPredictor. (And, in a more flexible compositional design, one expects there to be some price to pay in terms of convenience, no?) Here's an example of how forcing numerical scoring models to also provide probabilistic/classifying functionality is, to me, a bit weird: Suppose I want to combine two detectors using the combine/normalize apparatus. Then the composite model will have parameter normalize, but so will the component detectors. But these nested normalize hyperparameters are essentially dead. They have no effect on the behavior of the composite, and so are confusing:

I suppose you could keep the strapped on functionality in those models for using independently of MLJ (and suppress the extra parameters from the MLJ facing structs) but that seems a bit drastic seeing how well integrated we would otherwise be.

👍 Removed the hyperparameter again.

2. Pipelines. Given the possibilities provided by the wrappers, do you think there's going to be a big use case for including detectors in linear pipelines? In any case, this is probably going to have to wait for #594

Don't really know to be honest.

3. Where should code go? Assuming you are happy with the design, we may want to put all the "meta-functionality" in a separate repo, so others can re-use it. This would include both your Scores transformer, and the new ProbablisticDetector wrapper, which depends on the Scores and MLJBase (because it is implemented using learning networks). Any objections to this idea? (I expect you're plan is to exclude MLJBase as a dependency of OutlierDetection.jl, right?)

I explain the repo structure below.

4. Do you have objections to the changes to the Scores transformer output indicated above?

I would return only the test scores (and don't add the augmented_transform as described above)

5. Since we are essentially introducing a new task, we could easily rename transform everywhere above as score. I'm inclined to do that, but what do you think? We would retain the name predict to emphasize the commonality with Supervised model evaluation.

I don't think a special name for scoring is necessary if it does not provide any additional functionality compared to transform/predict.

6. What about transformers that drop observations in data (or substitute with missing values) according to the output of a deterministic detector?

Not sure yet.


Explanation of changes

I have split OutlierDetection.jl it into multiple repos. OutlierDetection.jl now only includes some helper functions and the probabilistic and deterministic wrappers. We could move that package to MLJOutlierDetection.jl, but I'm not sure about it as it contains all the docs for the sub-repos (as in MLJ).

OutlierDetectionInterface.jl defines a unified interface for outlier detection algorithms in the OutlierDetectionJL org, which simply re-exports the detector definitions from MMI, defines default class names, a unified data front-end, and default types to work with. Of course one could also define a detector directly with MMI.

Basic testing and benchmarking now lives in OutlierDetectionTest.jl and OutlierDetectionBenchmark.jl. All the algorithms live in their corresponding "backend" repository, e.g. OutlierDetectionNeighbors.jl.

I have created 3 pull requests from your proposal:

  1. MLJModelInterface https://github.com/JuliaAI/MLJModelInterface.jl/pull/113
  2. MLJBase https://github.com/JuliaAI/MLJBase.jl/pull/628
  3. MLJModels https://github.com/JuliaAI/MLJModels.jl/pull/399

The only missing pull request is the addition of the current outlier detection models into the MLJ registry. I have already tested it locally and @load-ing the models worked fine.

Let me know what you think about that structure!

ablaom commented 3 years ago

I've not looked at this in detail but at first glance this all looks good and I'll get to it soon.

I don't think it's necessary to add this new method. If we standardize the report key (e.g. .scores), we can simply use scores(mach) = node((mach, Xs) -> (report(mach).scores, transform(mach, Xs)), mach, Xs).>

I'm a bit surprised you don't like this. I thought augmented_transform was a nicer alternative to having a helper (but your helper did give me the idea for it). The problem with the helper is it does require a bit of an explanation and you kind of have to explain why it's needed. On the other hand, the augmented_transform is immediately explained, and with first and last overloaded for nodes (which is totally natural), the learning network syntax is straightforward. I think it's good to have a method for accessing the raw scores, rather than just burying them in fitted_params, as they're kind of central to composition, as you ultimately convinced me!

Could you say what you don't like about augmented_transform ?

davnn commented 3 years ago

I just thought it's not necessary to add a new method to the API (just because it's always easier to add than to remove something from the API imho). Generally, I like the idea of an augmented_transform, but I think then there should maybe also be an augmented_predict, augmented_predict_mode and so forth, which apply the corresponding transformation / prediction to the training and test data. Regarding the current proposal, it's a trivial change to use augmented_transform instead of the helper, so I think we can continue until we are sure about how to best proceed with this.

Edit: You could even define a default implementation that caches the training data prediction results for those methods if the result is not available directly from fit.

ablaom commented 3 years ago

I just thought it's not necessary to add a new method to the API (just because it's always easier to add than to remove something from the API imho).

Fair point.

so I think we can continue until we are sure about how to best proceed with this.

okay, let's put this off for now

davnn commented 3 years ago

Hi @ablaom,

once all packages are registered, I will add the models to the MLJ registry. Currently, the detectors are named KNNDetector, LOFDetector,.... The Python Versions are named PyKNNDetector, PyLOFDetector,... it would probably be more idiomatic to drop the Py-prefix, right?

Best David

ablaom commented 3 years ago

once all packages are registered, I will add the models to the MLJ registry.

Yes registering the models would be the next step. You're welcome to have a go but I usually do this, as it still a bit of a quirky operation which sometimes requires compatibility updates elsewhere in the eco-system.

The Python Versions are named PyKNNDetector, PyLOFDetector,... it would probably be more idiomatic to drop the Py-prefix, right?

Right. Make user you have the is_pure_julia trait set to true for the pure julia models, and users can distinguish the python ones in model queries.

Have you thought about where the ProbablisticDetector wrapper should live? I was thinking MLJOutlierDetection, since any outlier detection model satistifying our new MLJModelInterface API would be supported. The code is already there but under /examples/. The Supervised version is not written out and tests are needed. Is this something you are interested in flushing out?

davnn commented 3 years ago

Yes registering the models would be the next step. You're welcome to have a go but I usually do this, as it still a bit of a quirky operation which sometimes requires compatibility updates elsewhere in the eco-system.

alright, I tested it locally (only the OD packages) and it lgtm.

Right. Make user you have the is_pure_julia trait set to true for the pure julia models, and users can distinguish the python ones in model queries

👍

Have you thought about where the ProbablisticDetector wrapper should live? I was thinking MLJOutlierDetection, since any outlier detection model satistifying our new MLJModelInterface API would be supported. The code is already there but under /examples/. The Supervised version is not written out and tests are needed. Is this something you are interested in flushing out?

Currently, the code resides in OutlierDetection.jl, which can be seen as the main entry point to the OutlierDetectionJL org, which also bundles all the docs from the sub-packages (as MLJ does). The package contains the wrappers and helper functions (normalization, combination and classification approaches). I'm not sure yet if it makes sense to put this package outside of the OutlierDetectionJL org, but let me know what you think.

Edit: I would also not have registered the wrappers as they are only really useful when importing OutlierDetection.jl and the corresponding helper functions.

ablaom commented 3 years ago

alright, I tested it locally (only the OD packages) and it lgtm.

Okay, then do make a PR to MLJModels and I will take a look!

Currently, the code resides in OutlierDetection.jl, which can be seen as the main entry point to the OutlierDetectionJL org,

All good. I hadn't realised you had already found a home. Actually quite happy for another org to take responsibility for the code. :-)

Edit: I would also not have registered the wrappers as they are only really useful when importing OutlierDetection.jl and the corresponding helper functions.

Makes sense. If you do have wrappers in registered packages, just set MMI.is_wrapper(::Type{<:MyWrapper}) = true and it won't appear in ordinary model searches, which would be confusing.

davnn commented 3 years ago

Hi @ablaom,

I was now working with the API for a bit and one major shortcoming appears to be that you need to re-learn the models each time you want a different output. For example

using MLJ, OutlierDetection
X = rand(10, 100)

KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors
knn = KNN()
knn_raw = machine(knn, X) |> fit!
knn_probabilistic = machine(ProbabilisticDetector(knn), X) |> fit!
knn_deterministic = machine(DeterministicDetector(knn), X) |> fit!

would learn three times the exact same model with different outputs. Do you see any way of improving on this?

ablaom commented 3 years ago

Well, a model can implement more than one operation, eg, predict and transform. So, a partial solution would be to have transform return raw scores in the case of the wrappers (as it does for the raw detectors, yes?). To export multiple operation nodes from a learning network you do something like

MMI.fit(...)
    ...
    network_mach = machine(ProbabilisticUnsupervisedDetector(), Xs,   predict=X_probs, transform=X_raw)
    return!(network_mach, model, verbosity)
end

where X_raw is the node delivering raw predictions.

In the case of the DeterministicDector (I guess this is essentially the BinaryThresholdPredictor?) we could add a new operation predict_probs to predict the probabilities??

davnn commented 3 years ago

Well, a model can implement more than one operation, eg, predict and transform. So, a partial solution would be to have transform return raw scores in the case of the wrappers (as it does for the raw detectors, yes?). To export multiple operation nodes from a learning network you do something like

I know that this is possible, but that doesn't improve the sitatuation imho. In reality the score transformations are static operations, which should not be bound to the training of the model. I have to say that the first implementation with the ugly tuple transform output consisting of train/test scores was the most composable API.

Maybe we should add the proposed augmented_transform and default to it in linear pipelines and the network pipe syntax (if it still exists). The output type of the pipeline/network would depend on the last node (e.g. a static probabilistic transformation would lead to :probabilistic). It would also be necessary for such a pipeline/network to be evaluateable. This would also enable us to be more integrated with MLJ and more easily use other transformations, e.g. Standardizer().

Edit: From MLJ's viewpoint I would not implement the detector-specific changes and like to keep the API as general as possible, such that it fits detectors naturally; see the proposed type hierarchy mentioned below.

By the way, the ProbabilisticDetector would become a CompositeDetector, which combines multiple detectors but produces raw scores and also implements augmented_transform and thus would also be usable in a pipeline. It's already implemented, will add a link once pushed.

I'm happy to help adding the necessary features to make things work.

In the case of the DeterministicDector (I guess this is essentially the BinaryThresholdPredictor?) we could add a new operation predict_probs to predict the probabilities??

Yes with helpers to identify the threshold, which you don't know without looking at your training results. The DeterministicDector would not be necessary if we manage above API changes, it would become a deterministic static transformation.

Edit: In a more general setting, the type hierarchy could look something like: https://github.com/JuliaAI/MLJBase.jl/issues/656#issuecomment-934181891

The downside of such a change would be that we lose custom functionality for detectors, no default augmented_transform and we would have to live with transform returning the augmented results all the time. From my experiments with this kind of API, the usability and composability is great and I wouldn't mind such a transform result. I think such a change would greatly improve the generality of MLJ as a tool for various machine learning tasks.

ablaom commented 3 years ago

Edit: From MLJ's viewpoint I would not implement the detector-specific changes and like to keep the API as general as possible, such that it fits detectors naturally; see the proposed type hierarchy mentioned below.

That might be so, but for now at least, I prefer to avoid breaking changes to MLJModelInterface for the near future. See my comments at your suggestion. I suggest you work the best you can with the status quo.

I'm a bit confused by your comments, possibly because of the edits. Could you more succinctly summarize the proposed changes (consistent with current API)?

davnn commented 3 years ago

That might be so, but for now at least, I prefer to avoid breaking changes to MLJModelInterface for the near future. See my comments at your suggestion. I suggest you work the best you can with the status quo.

Alright, I think we can close this issue now as we have found an initial, workable API.

I'm a bit confused by your comments, possibly because of the edits. Could you more succinctly summarize the proposed changes (consistent with current API)?

I will open another issue to discuss possible future MLJ changes in more detail.

ablaom commented 3 years ago

Re-opening so we can track some outstanding doc issues:

ablaom commented 2 years ago

@davnn I wonder if you would be able to help me complete the above checklist, especially the last point. I think enough time has passed to consider the API as "stable for now", yes?

The essential points are to explain the contract a model enters into when it implements a subtype of one of the following six abstract types:

abstract type UnsupervisedDetector <: UnsupervisedAnnotator end
abstract type SupervisedDetector <: SupervisedAnnotator end

abstract type ProbabilisticSupervisedDetector <: SupervisedDetector end
abstract type ProbabilisticUnsupervisedDetector <: UnsupervisedDetector end

abstract type DeterministicSupervisedDetector <: SupervisedDetector end
abstract type DeterministicUnsupervisedDetector <: UnsupervisedDetector end

I am increasingly hazy about some of the final details.

davnn commented 2 years ago

@davnn I wonder if you would be able to help me complete the above checklist, especially the last point. I think enough time has passed to consider the API as "stable for now", yes?

Sorry for being a bit hesitant to implement the changes in the docs. I'm simply not sure if we can consider the API "stable for now" as it is, in my opinion, closely related to other open issues, especially the clustering API. Once we want to evaluate clustering algorithms, or implement semi-supervised clusterers, we would basically have to define

abstract type UnsupervisedClusterer <: UnsupervisedAnnotator end
abstract type SupervisedClusterer <: SupervisedAnnotator end

abstract type ProbabilisticSupervisedClusterer <: SupervisedClusterer end
abstract type ProbabilisticUnsupervisedClusterer <: UnsupervisedClusterer end

abstract type DeterministicSupervisedClusterer <: SupervisedClusterer end
abstract type DeterministicUnsupervisedClusterer<: UnsupervisedClusterer end

I think we agree that this is not the way to go. Would you say a clustering API is out of scope for 1.0?

ablaom commented 2 years ago

@davnn I think we are on the same page in terms of future directions, just not on the same page in terms of timeframe. Moving from types to a more flexible traits-only system requires a breaking change to MLJModelInterface, on which about two dozen repositories depend; many of these will be broken by the changes.

Yes, I am proposing to add documentation we will need to change later, but I expect this is normal for a project as complex as MLJ. We are building a car. But we are also driving it and writing the manual at the same time.

In any case, I would be happy to mark the Outlier Detection section of the manual as "Experimental", as we have done elsewhere.