Unsupervised learning interfaces - is transformer too narrow?

fkiraly commented 5 years ago

Regarding unsupervised models such as PCA, kmeans, etc discussed in #44.

I know these are commonly encapsulated within the transformer formalism, but it would do the methodology behind them injustice as feature extraction is only one major usage cases of unsupervised models. More precisely, there are, as far as I can see, three use cases:

(i) feature extraction. For clusterers, create a column with cluster assignment. For continuous dimension reducers, create multiple continuous columns.

(ii) model structure inference - essentially, inspection of the fitted parameters. E.g., PCA components and loadings. Cluster separation metrics etc. These may be of interest in isolation, or used as an (hyper-parameter) input of other atomic models in a learning pipeline.

(iii) full probabilistic modelling aka density estimation. This behaves as a probabilistic multivariate regressor/classifier on the input variables.

For the start if makes sense to implement only "transformer" functionality, but it is maybe good to keep in mind for implementation that eventually one may like to expose the other outputs via interfaces. E.g., the estimated multivariate density in a fully probabilistic implementation of k-means.

ablaom commented 5 years ago

I think this is a good point. There are two choices for exposing extra functionality at present:

(i) fit may return additional information in its report dictionary (this could include functions/closures but was not the original intention)

(ii) one implements methods beyond transform dispatched on the fit-result. This presently requires adding ("registering") the method name to MLJBase.

fkiraly commented 5 years ago

@ablaom, I think the report dictionary returned by fit should, at most, be diagnostic reports of the fitting itself and not be abused for parameter inference or reporting.

I'd personally introduce a single method for all models, e.g., fitted_params which could return a dictionary of model parameters and diagnostics. These would be different for each model - for example, for ordinary least squares regression, it might return coefficients, CI, R-squared, and t/F test results.

What we may want to be careful about is the interaction with the parameter interface. I usually like to distinguish hyper-parameters = set externally, not changed by fit, and model parameters = no external access, set by fit.

ablaom commented 5 years ago

Two issues here:

Type of information to be accessed after a fit call. I suppose we can classify these into "parameter inference" and "other". It's not clear to me how "other" can be unambiguously divided further, but help me out here if you can.

Method of access. Dictionary or method. The original idea of dictionary was that it would be a persistent kind of thing, or even some kind of log/history. A dictionary has the added convenience that one adds keys according to circumstance (e.g., if I set a hyperparameter requesting fit to rank features, then :feature_rankings is a key of the report dictionary, otherwise it is not.) Actually, report isn't used currently to maintain a running log at the moment (by the correspondingmachine) but it could be. A method has the advantage that extra computation required to produce the information wanted can be avoided until the user calls for it. Now that I think of it, method and dictionary could be combined - method computes a dictionary that it returns.

I like the simplicity of a returning a single object to report all information of possible interest, computed after every fit, whether it be fitted parameters or whatever. What is less clear to me is whether information that requires extra computation should be accessed:

(i) by requesting the computation through an "instruction" hyperparameter and returning the result in the same report object; or

(ii) having a dedicated method dispatched on the fit-result, like predict.

Your thoughts?

What we may want to be careful about is the interaction with the parameter interface. I usually like to distinguish hyper-parameters = set externally, not changed by fit, and model parameters = no external access, set by fit. Agreed!

fkiraly commented 5 years ago

Some thoughts (after a longer time of thinking):

I think it would be a good idea to have a dedicated interface for fitted parameters, just as we have for hyperparameters, i.e., dictionary-style, and following exactly the same structure, nesting and accessor conventions for the fitting result as we have for the models.

What is automatically returned in this extension of fitresult are "standard model parameters that are easy to compute", i.e., it can be more than what predict needs but shouldn't add a lot of computational overhead. It also should be data-agnostic model structure parameters (e.g., model coefficients), or easy-to-obtain intermediate results for diagnostics (e.g., R-squared?).

Separate from this should be operations on the model that require significant computational overhead over fit/predict (e.g., variable importance), or that are data-dependent (e.g., F-test in-sample).

The standard stuff - i.e., standard methodology for diagnostics and parameter inference (e.g., for OLS, t-tests, CI, F-test, R-squared, diagnostic plots) I'd put in fixed dispatch methods diagnose (returns pretty-printable dict-like of summaries) or diagnose_visualize (produces plots/visualizations).

Advanced and non-standard diagnostics (e.g., specialized diagnostics or non-canonical visualizations) should be external, but these will be facilitated through the standardized model parameter interface once it exists.

Thoughts?

ablaom commented 5 years ago

@fkiraly I have come around to accepting your suggestion for a dedicated method to retrieve fitted parameters, separate from the report field of a machine. I also agree that params and fitted_params (which will have "nested" values for composite models) should return the same kind of object. I think a Julia NamedTuple (like a dict but with ordered keys and type parameters for each value) is the way to go. This will also be the form of the (possibly nested) report field, and report will get an accessor function, so that params, fitted_params, report are all methods that can be called on a (fitted) machine to return a named tuple.

I am working on implementing these various things simultaneously.

tlienart commented 5 years ago

I think a Julia NamedTuple (like a dict but with ordered keys and type parameters for each value) is the way to go

A noteworthy difference being that a NamedTuple is immutable, could that cause a problem here?

fkiraly commented 5 years ago

@ablaom, I'm onboard with NamedTuple or dictionary returned by method. The method be able to return abstract structs in its fields, and should be able to change with each run of fit.

Regarding user interface: I'd make it a method (by dispatch), and call it "inspect" unless you have a better idea.

On a side note, I think this would also help greatly with the issue highlighted in the visualization issue #85 , the "report" being possibly arcane and non-standardized.

Further to this, I think computationally expensive diagnostics such as "interpretable machine learning" style meta-methods should not be bundled with "inspect", but rather with external "interpretability meta-methods" (to be dealt with at a much later point). The "inspect" interface point should be reserved for parameters or properties which do not add substantial computational overhead over "fit" - this could, for example, be defined as only constant (or log(# training data pts) ) added computational effort above "fit".

fkiraly commented 5 years ago

Hm, maybe another two default interface points - "print" and "plot" would be great? These are default interface points in R.

"print" gives back a written summary, for example

Call:
lm(formula = weight ~ group - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
groupCtl   5.0320     0.2202   22.85 9.55e-15 ***
groupTrt   4.6610     0.2202   21.16 3.62e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.9818,    Adjusted R-squared:  0.9798 
F-statistic: 485.1 on 2 and 18 DF,  p-value: < 2.2e-16

"plot" produces a series of standard diagnostic plots, which may differ by model type and/or task. I would conjecture there's some that you always want for a task (e.g., cross-plot and residual plot for deterministic supervised regerssion; calibration curves for probabilistic classification), and some that you only want for a specific model class (e.g., learning curves for SGD based methods, heatmaps for tuning methods)

fkiraly commented 5 years ago

Interesting question: where would "cross-plots out-of-sample" sit? Probably only available in the evaluation/validation phase, i.e., with the benchmark orchestrator.

fkiraly commented 5 years ago

Actually, I notice you already made a suggestion for a name: fitted_params. Also fine with me - though I wonder, should this include easy-to-compute stuff such as F-statistic and in-sample-R-squared as well? Or should that be left to (a separate interface point!) "inspect"? Thoughts?

fkiraly commented 5 years ago

Also I realize, I've already said some of these things, albeit slightly differently, on Feb 4. So greetings, @fkiraly from the past, I reserve the right to not fully agree with you.

ablaom commented 5 years ago

To clarify the existing design, we have these methods (dispatched on machines, params also on models):

params to retrieve possibly nested hyperparameters
fitted_params to retrieve possibly nested learned parameters
report to retrieve most everything else (could be nested), including computationally expensive stuff

As laid out in the guide (see below): Whether or not a computationally expensive item is actually computed is controlled by an "instruction" hyperparameter of the model. If a default value is not overridden, the item is empty (but the key is still there), a clue to user that more is available. I prefer this to a separate method to avoid method name proliferation.

I think the above cover MLR's "print" method. But we could overload Base.show for named tuples to make more user-friendly. Don't like name "print". Print what? Just about every command prints something. (edit but you could say the same about "report" - aarrgh!. Maybe "extras" ??)

Not so keen on changing name of "report" as this is breaking.

@tlienart I think every item of report should be regenerated at every call to fit (or update) so that information there is synchronised with the hyperparamter values attached to the machine's current model. So immutability not an issue. So far, the params method is just a convenience method for the user; tuning is carried out using other methods.

From the guide:

report is a (possibly empty) NamedTuple, for example, report=(deviance=..., dof_residual=..., stderror=..., vcov=...). Any training-related statistics, such as internal estimates of the generalization error, and feature rankings, should be returned in the report tuple. How, or if, these are generated should be controlled by hyperparameters (the fields of model). Fitted parameters, such as the coefficients of a linear model, do not go in the report as they will be extractable from fitresult (and accessible to MLJ through the fitted_params method, see below).

...

A fitted_params method may be optionally overloaded. It's purpose is to provide MLJ accesss to a user-friendly representation of the learned parameters of the model (as opposed to the hyperparameters). They must be extractable from fitresult.

MLJBase.fitted_params(model::SomeSupervisedModelType, fitresult) -> friendly_fitresult::NamedTuple

For a linear model, for example, one might declare something like friendly_fitresult=(coefs=[...], bias=...).

The fallback is to return (fitresult=fitresult,).

fkiraly commented 5 years ago

Very sensible. Maybe, do you want to make plot a specified/uniform interface point as well, along the lines of your suggestion in #85 (and/or mine above)?

Small detail regarding your reference "mlr's print". mlr doesn't have a too good interface for pretty-printing or plotting.

It is actually the R language itself (i.e., base R) which has "print" and "plot" as designated interface points. Agreed with "print" being a strange choice of name though for pretty-printed reports - when I first saw this long long ago, I thought it might mean saving to a file, or calling an actual printer.

fkiraly commented 5 years ago

"report" could be "inspect" the next time we write an MLJ, but let's not change a working system.

ablaom commented 5 years ago

At the moment the Plots.jl package "plot" function just about the "standard" Julia interface point for plotting, although the future is not clear to me and others may have a better crystal ball.

Plots.jl is a front end for plotting and, at present, most of the backends are still wrapped C/Python/Java code. It is a notorious nuisance to load and execute first time. However, there is a "PlotsBase" (called PlotRecipes) which allows you to import the "plot" function you overload in your application, without loading Plots or a backend (until you need it).

fkiraly commented 5 years ago

... we could factor out in a MLJplots module, thus solving the dependency issue? I come starting to appreciate how Julia's dispatch philosophy makes this easy (though its package management functionality could be improved).

ablaom commented 5 years ago

No, no. This is not necessary. We only need PlotsBase (lightweight) as a dependency. The user does need to manually load Plots.jl if they want to plot, but I don't think that's a big deal. The backends get lazy-loaded (ie, as needed).

ablaom commented 5 years ago

@fkiraly and others. Returning to your original comment opening this thread, where should one-class classification fit into our scheme? Unsupervised, yes?

fkiraly commented 5 years ago

In terms of taxonomy, I'd consider that something completely different, i.e., neither supervised nor unsupervised.

I'd consider one-class classifiers (including one-class kernel SVM) as an instance of outlier detectors, or anomaly detectors (if also on-line).

Even in the case where labelled outliers/artefacts/anomalies are provided in the training set, it's different from the (semi-)supervised task, since there is a designated "normal" class.

It's also different from unsupervised, since unsupervised methods have no interface point to feed back "this is an anomaly".

I.e., naturally, the one-class-SVM would have a task-specific fit/detect interface (or similar, I'm not too insistent on naming here).

One could also consider it sitting in the wider class of "annotator" tasks.

datnamer commented 5 years ago

Does this mean the type hierarchy is not granular enough. Maybe it should be traits

fkiraly commented 5 years ago

@datnamer, that's an interesting question for @ablaom - where do we draw the distinction between type and trait?

If I recall an earlier discussion correctly, whenever we need to dispatch or inherit differently?

It's just a feeling, but I think anomaly detectors and (un)supervised learners should be different - you can use the latter to do the former, so if feels more like a wrapper/reduction rather than trait variation.

ablaom commented 5 years ago

Some coarse distinctions are realised in a type hierarchy. From the docs:

The ultimate supertype of all models is MLJBase.Model, which has two abstract subtypes:

abstract type Supervised <: Model end
abstract type Unsupervised <: Model end

Supervised models are further divided according to whether they are able to furnish probabilistic predictions of the target (which they will then do so by default) or directly predict "point" estimates, for each new input pattern:

abstract type Probabilistic <: Supervised end
abstract type Deterministic <: Supervised end

All further distinctions are realised with traits some of which take values in the scitype hierarchy or in types derived from them. An example of such a trait is target_scitype_union.

So, I suppose we create a new abstract subtype of MLJ.Model, called AnomalyDetection? With a predict method that only predicts Bool ? Or only predicts objects of scitype Finite{2} (a CategoricalValue{Bool})? With the same traits delineating input scitype types that we have for Unsupervised models, yes?

Obviously this not a priority right now but it did recently come up.

fkiraly commented 5 years ago

@ablaom regarding AnomalyDetection agreed, though I'd just call it detect rather than predict.

Regarding unsupervised learners: have we progressed about the distinction between (i) and (ii) at least, from the first post? For #161 especially, a "transformer" type (or sub-type? aspect?) as in (i) would be necessary.

Update: actually, I think we will be fine with (i), i.e., transformer style behaviour only for ManifoldLearning.jl in #161.

JuliaAI / MLJ.jl

Unsupervised learning interfaces - is transformer too narrow? #51