JuliaAI / MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
MIT License
78 stars 27 forks source link

`asinh` transformation #536

Open ParadaCarleton opened 8 months ago

ParadaCarleton commented 8 months ago

Similar to the Box-Cox transformation, the asinh or pseudolog transformation is a common transformation for reducing skewness and stabilizing variance. It's most often used for variables that are roughly log-normal, but can take on both positive and negative values; for example, net worth is often well-modeled as log-normal for the majority of the population, but can be negative if debts exceed assets. asinh(x/2) is approximately equal to ln(|x|) for large values of |x|, but is approximately equal to x for values close to 0.

The general form of the transformation is x = scale * asinh(x / (2scale)), with scale a parameter chosen to satisfy some criterion such as stable variance, minimum skewness, or maximizing the log-likelihood that the data come from a normal distribution.

ablaom commented 8 months ago

Sounds like a good idea. PR welcome :wink:

ParadaCarleton commented 8 months ago

Sounds like a good idea. PR welcome 😉

Working on this ATM, which optimization package do you recommend for optimizing a continuous parameter? I know there's a dozen implementations of gradient descent, Newton's method, etc. but I'm not sure which of these packages are already being used by MLJ.jl (I don't want to add too much compilation time).

ablaom commented 8 months ago

The only optimization that is a core part of MLJ is MLJTuning, but that currently applies only to supervised models. Thanks for the offer of help!

I'm somewhat embarrassed to say that UnivariateBoxCoxTransformer uses a simple grid search over a fixed parameter range (the resolution is a hyperparameter). It seems to work fine, and we've not had any complaints about it. That said, I'd be happy for you to implement something better here (or in that case). If you need an extra dependency, I suggest putting the implementation in a separate package. (We certainly don't want to add any AD dependency); see MLJTSVDInterface.jl for a template. Generally, going forward, I'm reluctant to add new built-in models to MLJModels, so this might be best in any case.

What do you think? What optimiser do you have in mind?

ablaom commented 8 months ago

I see in other threads you have been considering using MLJTuning.jl for optimization. I think this is probably overkill, and will require you to formulate the problem as a supervised learning problem. That's not impossible, I just think that's an unnecessarily complicated route.

I see that the SciPy BoxCox implementation uses Brent's method (a bisection method), which is reliable and plenty fast for 99% of the use cases I can think of. It is provided by Optim.jl here. I'll have a think about whether we want to add Optim.jl as a dep, but you can always put your implementation in a standalone package. It could then implement both the MLJModelInterface.jl and TableTransforms.jl interfaces, if you want. We recently did this at Imbalance.jl. (But note that those models are Static transformers, ie do not generalize to new data, so the implementation has not fit, only transform.)

ParadaCarleton commented 8 months ago

I see in other threads you have been considering using MLJTuning.jl for optimization. I think this is probably overkill, and will require you to formulate the problem as a supervised learning problem. That's not impossible, I just think that's an unnecessarily complicated route.

Hmm, that's surprising. I didn't know MLJTuning.jl only supported supervised models.

You're 100% right this is overkill for the use case, but I'm on a yak shaving quest and nothing can stop me (except my limited attention span and time).

What I started looking for (and really, the cleanest way to accomplish this) is some kind of generic interface that separates the optimizer or model-fitting procedure from the model itself. I basically just want to tell the model, "Use a Box-Cox transformation for the predictors, followed by a neural network/linear regression/whatever. We can optimize all the parameters with gradient descent at the same time." Or I might want to use a model, then apply a Box-Cox transformation to the output, with everything being autodiffed through.

ablaom commented 8 months ago

| Hmm, that's surprising. I didn't know MLJTuning.jl only supported supervised models.

Minor correction. MLJTuning also supports (possibly unsupervised) outlier detection models. I guess, in principle, it supports any model implementing predict (e.g., KMeans clustering), so long as you have a way of pairing the output of predict with some ground truth "target", using some measure. In practice, however, our abstract type hierarchy gets in the way, so some changes might be needed.

As a side note, the general consensus seems to be that MLJ's abstract model type hierarchy was not an optimal design decision, but it's considerably embedded in the the eco-system. For example, it means independently developed models, like those provided by TablesTransforms.jl, cannot be integrated into MLJ without the use of a wrapper or a more complex "duplicate" interface. Of course, the 3rd party package could instead buy into the type hierarchy, but they may have good reasons for not doing so.

Returning to your "exploration" of an asinh transformer, using MLJTuning.jl (as is), here's one tentative suggestion. We regard the model as a distribution-fitter. The "distribution" we are fitting is a generalization of Normal but with extra parameters for the asinh-transform. A distribution-fitter can be contrived as a supervised model, whose training features X are always nothing. ; only the target y plays a role. See, e.g., here and the linked example from tests for details, but note this is experimental and not implemented anywhere, as far as I can tell. We optimise the parameters by wrapping the supervised model in TunedModel from MLJTuning. If we specify measure=log_loss then the wrapped model predicts the distribution with maximum likelihood values of the parameters. (Other proper scoring rules compatible with our new distribution could also be used.) We extract the transform parameters and wrap the whole workflow in a learning network, exported so that transform applies the learned transform to the supplied (new) data.

One fly in the ointment is that yhat = predict(distribution_fitter, fitresult, nothing) is a single (probabilistic) prediction, but MLJTuning expects to pair the output with multiple ground truth observations y. One thought was to provide a "cone" wrapper for measures that allow you to pair a single (probabilistic) prediction with multiple ground truth labels, but I never got around to seeing if this would actually work, and there could be better ideas. Maybe predict(distribution_fitter, fitresult, vector_of_nothing) predicts copies of the distribution ??....

I understand that I am not addressing the fact that MLJTuning is missing some hooks into more general optimisation strategies. But I think we can break that off as a separate issue, right?

ParadaCarleton commented 8 months ago

I think what you're suggesting works, but for now I've implemented a hard-coded optimizer using Newton's method :sweat_smile:

Mostly just to separate this into 2 PRs, since overhauling the whole workflow for tuning unsupervised models seems like it should get a separate issue.