JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.79k stars 158 forks source link

Integrate flux models #33

Closed ysimillides closed 4 years ago

ysimillides commented 5 years ago

Would be good to have some flux integration

ysimillides commented 5 years ago

@ayush1999 this might interest you, alongside @sjvollmer

ayush1999 commented 5 years ago

@ysimillides Definitely interested. I opened an issue regarding this : #19 . Looks like things have changed since then, right?

ablaom commented 5 years ago

Things have changed but the API for external packages has stabilised. See here for the spec.

You may also want to look at KoalaFlux which is run under Koala. A nice feature here is that categorical features are handled, through learned feature embeddings. Once can then export the learned embeddings as a pre-transformation for other models that don't handle categoricals. However, this should not be something to incorporate in a first implementation, unless you decide to just port the Koala code (which I may have a go at if I get time).

The key design question is how to encode the desired neural network architecture as hyper parameters of the the MLJ model. In KoalaFlux the model gets a hyperparameter network_creator which is a function which maps an integer (the number of input features) into a Flux.Chain object (see the KoalaFlux test code). This requires the user is familiar with building a Flux chain so may not be ideal as a final solution. While in the short-term I think this is fine, I welcome suggestions for better way.

Also, the optimiser should be a hyperparameter, which I did not get around to doing (just used momentum).

Warning: "model" in MLJ terminology != "model" in Flux terminology. In MLJ a "model" is the name for a container of hyper-parameters.

fkiraly commented 5 years ago

Regarding spec: mostly agreed.

Minor issues below.

Metadata:

Regarding interfacing neural networks/deep learning: an architecture we’re following in another project is as follows, neural networks have a high-level and a low-level interface. High-level: summary hyper-parameters, such as number of layers, activation function, etc Low-level: hyper-parameter = the entire code to specify the NN architecture in the package (e.g., keras set-up)

That is, the same package may have different interfaces allowing different levels of implementation detail. Obviously, high-level is more off-shelf but allows less customisation. Wonder whether there is a mid-level that makes sense here.

ablaom commented 5 years ago

@fkiraly

Yes, yes, in my hasty revisions some things have gotten muddled. Mia colpa.

This is a mistake. "multiclass" means more that two classes for any target (usually one) "multivariate" means multiple targets.

What do you suggest?

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs. The "operations" key tells you if a probabilistic prediction is possible (by including "predict_proba" as a value, in addition to "predict"). See the the new "adding_new_models.md just pushed.

Sure. That can be achieved trivially (and simultaneously for all models) at the level of the "machine interface" and is not needed for the model interface.

Regarding @fkiraly comments on NN. I agree, we should allow the user to interact with Flux at different levels of abstraction. As, I say, to start with, let's implement the low-level variety, i.e. user essentially specifies the whole code (which in Flux is not that bad) that generates the architecture (given, say some information about the data, say how many inputs).

Okay, I'd best be off to the airport.

fkiraly commented 5 years ago

on a similar note: you want to think very carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be…

What do you suggest?

Answered this in #34.

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs.

Would suggest against doing this! This further muddles the interface in my opinion, by trying to infer from names of methods a struct has what it does. "secret knowledge" is not a good API design principle...

ablaom commented 5 years ago

I don't see that the knowledge is "secret" since the information goes into the metadata. That is, if predict_proba is implemented for a model DecisionTreeClassifier, then I will have

metadata(DecisionTreeClassifier)["operations"] = ["predict", "predict_proba"]

That said, I would like to understand the alternative you are suggesting. Are you suggesting that instead of ONE model DecisionTreeClassifer with TWO methods predict and predict_proba, we instead have TWO models DecisionTreeClassifier and DecisionTreeClassifierProbabilistic each with a UNIQUE predict method, and write

metadata(DecisionTreeClassifier)["outputs_are"] = ["nominal", "multiclass"]
metadata(DecisionTreeProbabilistic)["outputs_are"] = ["nominal", "multiclass", "probabilistic"]

?

If so, what are the advantages, apart from being a little more explicit about purpose? Some disadvantages that I see are:

If I have misunderstood your suggestion, can you please explain your alternative in more detail?

fkiraly commented 5 years ago

I think it would be cleaner if you have multiple models.

unless the interface is able to explicitly tell you that the DecisionTreeClassifier can do both (i.e., can have multiple metadata entries), this will imply the convention that all models with the probabilistic flag will have to implement both the probabilistic and the deterministic variant.

In the clean world, getting the determinstic prediction from thresholding is natural through attaching a target transformer, which introduces the "threshold" hyper-parameter.

If you want both types of prediction, the interface could recognize that you're asking the probabilistic model for a deterministic prediction, and automatically convert by applying the 1/2 thresholder say (which is not always the best solution regarding misclassification rate! A trained threshold on training data may be better). In such a case, only one method has to be defined.

The obvious problem with your solution is that you force users to provide the deterministic "predict" functionality, and this is usually done by 1/2 thresholding. The fact that here a choice is made is swept entirely under the rug, and it may force users to make a choice that is unnatural.

The last problem about design incongruence I do not see: I do not think this would imply you would have to split transformers. For classifiers, we are talking about tasks, i.e., what does it do. The classical classifier design solves two tasks in one. Whereas for a target transformer, transform-input and transform-output is part of solving a single task, so this perceived incongruence rather looks like an error of categories to me, sorry.

ablaom commented 5 years ago

Returning to the original issue, I would like to say I am rethinking the way in which deep-learning is to be integrated with MLJ.

I think we can have a more seamless integration of deep learning and the other paradigms after realising that once we have a general gradient descent tuning strategy (for suitably hyperparameters of pure Julia algorithms), then our learning networks (exported as models) are essentially generalisations of neural networks. Our tuning strategy will allow tuning of (possibly nested) hyperparameters of such "generalized networks" and to incorporate component models that are standard neural network architectures, we simply wrap them as models in which we declare the network weights (in a Flux chain for example) as hyperparameters, rather than learned parameters. Since these parameters completely determine the model, fit for these models essentially does nothing and the training of weights is externalised to MLJ (which has integrated the SVG optimiser and its variants as way of tuning the hyperparameters of any model.)

I admit this is a bit confusing at first. Is any of this making sense to others?

fkiraly commented 5 years ago

Um, wouldn't this mean replicating all the features of flux, and, at the same time, generalizing them?

This sounds like a project as large as mlj itself... I like the idea since it would allow you to build "learning networks" for arbitrarily specified (and arbitrarily complex) input/output combinations.

However, it also seems very ambitious given the current development team. Maybe it would be helpful to get flux's opinion on this?

And in the interim, an integration which is seamless only on the interface level (rather than down to the full model specification) might be the way to go?

Also, in general, the neural network specific syntax of flux is very helpful for building neural network architectures, or retrieving default architectures. I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

datnamer commented 5 years ago

Makes sense to me theoretically. Implementation feasibility aside, this was the insight behind the original Julia ML http://www.breloff.com/transformations/ where nodes/layers would be any transformation, including traditional learning algorithms.

Edit: On the other hand, flux isn't supposed to be just a bunch of layers...with the whole differentiable programming paradigm, it seems like MLJ is actually a subset of what can be expressed with flux. https://fluxml.ai/2019/02/07/what-is-differentiable-programming.html

fkiraly commented 5 years ago

@datnamer - why do you think MLJ is a subset of flux? I don't think currently the expressibility of one, in terms of modelling, is a subset of the other.

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Regarding transformations: yes, I think this is the right idea. Though not every algorithm is an instance of "fitting the parameters by (regularised) gradient descent" - which seems to be a common misunderstanding of the deep learning age?

fkiraly commented 5 years ago

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

ablaom commented 5 years ago

@fkiraly

I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

No, no. We don't duplicate the Flux syntax. You define a Flux chain, using their nice syntax. Then you have a standard wrapper for such objects, allowing you to slot them in as component models in an MLJ "learning network" (which might include non-nn components). Only the training of the nn get's externalised to MLJ (by declaring the neural net weights as model hyperparameters, instead of regarding them as parameters to be learned by calling fit) not its' specification.

So syntax might look something like:

transformer = FluxWrapper(chain=Dense(100, 10, Flux.σ)) # dimension reducer
regressor = LightGBM(alpha=0.1) # a non-neural network model
composite = @pipeline transformer regressor

If you wrap composite in a SGD tuning strategy, and specify transformer.chain and regressor.alpha as the hyperparameters to be tuned, then fitting the tuned-model-wrap will simultaneously train the weights of the nn, and tune the regularisation parameter alpha of the regressor (assuming LightGBM is written in Julia, and so tune-able by SGD). For hyperparameters that cannot be tuned by SGD (eg, regressor.max_depth), you do separate tuning wraps (with different tuning strategies, such as grid search). As we have at present, these wraps could either be local (for tuning models individually) or global (to tune parameters in multiple component models simultaneously). Also, if we want a flux model to present more "conventionally", we could wrap it locally in a SGD tuning strategy, but there are times you might not want to do this.

fkiraly commented 5 years ago

Ah, I didn't think of that! You have a tuning wrapper which can generically tune SGD-fittable parameters and would fully interface with a composite model within? That would be genious (if it would work).

Also very interesting, since it looks like an instance of separating a "fitting strategy" from the "model specification" on the level of composites, which I don't quite understand how it could/should look generically.

fkiraly commented 5 years ago

Regarding the last bit, remembering an instance of this: Bayesians occasionally do this within the "probabilistic programming" paradigm. Though, of course, that only supports Bayesian style fitting...

ablaom commented 5 years ago

You have a tuning wrapper which can generically tune SGD-fittable parameters

Well, we don't have it yet, but getting this working has been, I understand, a goal that predates my involvement in MLJ. I think @tlienart had a go at this already for some restricted class of models. Flux already provides an AD module we can use. The problem that I see is that the parameters to be tuned (ie, with respect to which we want to differentiate) must be wrapped as "tracked" arrays, and I don't see how we can do this from outside the model (whose hyperparameter types are fixed). However I understand @MikeInnes has been working on changes to the AD engine that might make this easier. Perhaps he can clarify.

So, yes, this is still somewhat speculative. The main point I want to raise is that we should not be in a hurry to integrate neural networks in the naive way (fitted in isolation), if there is a more elegant solution around the corner.

fkiraly commented 5 years ago

@ablaom yes, with "you have" I meant, of course, "in the context of the plan/design which is not yet implemented".

Though I disagree with the conclusion: as you say the full "learning network" design is somewhat speculative, and would be one-of-a-kind (so who knows whether in the end it's brilliant or just a curiosity). Whereas integration-by-interface, e.g., a simple wrapper for a flux architecture specification, is not a lot of work, and the design is obvious.

Rephrasing, I wouldn't avoid integrating an important model class entirely because there's highly interesting (but risky) research to be done on it - especially since it seems it's a quick (but somewhat boring) job? Maybe there are volunteers...

datnamer commented 5 years ago

@fkiraly :

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Yes, that's what I meant. Flux models can be just functions soon as I think even the need for layers might be going away: https://github.com/FluxML/model-zoo/blob/notebooks/other/flux-next/intro.ipynb (@ablaom, you can take a look at that notebook to see where flux is going. The boilerplate for models is going to be reduced and tracker types will no longer be needed when Zygote.jl is ready for prime time. More info on the AD and compiler stuff here: https://drive.google.com/file/d/1tK4n3qQ5YsJkLc-8FEw5JMa90gHHfh3i/edit )

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

You can find some discussion here: https://github.com/JuliaML/Roadmap.jl/issues and on the blog I linked. @Evizero might know more. I don't think @tbreloff is still working on Julia open source.

fkiraly commented 5 years ago

@datnamer thanks - the Roadmap.jl doesn't look like it has a full set of design or org documents, most issues are more like an eclectic feature wishlist? There are also partial designs which seem to focus on optimization based machine learning methods (?). A number of the thoughts might be useful.

The fate of Roadmap.jl also makes me think, @ablaom - perhaps at some point we may like to write up the key design decisions for the benefit of future generations, just in case we all get run over by the bus or something.

ablaom commented 4 years ago

Registration of the the new MLJFlux models with the MJL model registry: https://github.com/JuliaRegistries/General/pull/16728


using MLJModels
julia> models("Flux")
4-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ImageClassifier, package_name = MLJFlux, ... )                  
 (name = MultitargetNeuralNetworkRegressor, package_name = MLJFlux, ... )
 (name = NeuralNetworkClassifier, package_name = MLJFlux, ... )          
 (name = NeuralNetworkRegressor, package_name = MLJFlux, ... )