Closed ablaom closed 4 years ago
Ultimately breaking but suggest keeping old syntax temporarily.
One thought would be that instead of model1
model1
you could, reasonably easily, have a name that depends on the object so something like onehotencoder
, linearregressor
or some version thereof (adding numbers if the name is already present, maybe adding an indicator of position), it might make inspection of a pipeline (or of a learning network) easier?
PS: the suggestion for the from network seems good and reasonable
Good point, thanks! One counter argument is that a user can change a pipeline field value to a model of a different type (so linearregressor=LinearRegressor()
becomes linearregressor=KNNRegresssor()
after mutation) and the original name loses it's applicability (unless the name is generic, like deterministic_classifier
).
Which raises another issue, which is what to do if a user does not specify a type for a model-valued field in the @from_network
syntax. We could either:
Leave the field untyped (leading to abuse, such as replacing a Deterministic
model with a Probabilistic
one, or replacing a Supervised
one with an Unsupervised
one.
Use a fallback type restriction informed by the default model specified, probably one of Deterministic
, Probabilistic
, Unsupervised
, Static
Be even more restrictive, to ensure mutations to a model-valued field are not reflected in an enlargening of the target_scitype
(binary classifier -> general multiclass classifier) or input_scitype
(mixed features -> continuous only features), and so forth.
My vote is for 1, as 2 is more "hidden knowledge" and I don't see users changing the model types unless they kinda know what they're doing; 3. is considerably harder to implement and will possibly rule out mutations that actually don't break anything and might be useful.
right that makes a lot of sense; I think this is fine, there's potentially some easy "prettying" that can be done to display a pipeline in the REPL and quickly get an idea of the composing parts.
some easy "prettying" that can be done to display a pipeline in the REPL and quickly get an idea of the composing parts.
What specifically would you improve on the status quo for showing composites, demonstrated below?
julia> @pipeline MyComposite(model1=OneHotEncoder(), model2=(@load KNNRegressor))
MyComposite(
model1 = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true,
ignore = false),
model2 = KNNRegressor(
K = 5,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = :uniform)) @ 4…66
I think in terms of information what you have is fine. I’m just thinking that there could be a side package that displays stuff a bit like a nice graph with labels using an electron window and light graph or something. This is very much not a priority but it may help people get their head around more complex networks and/or have a nice way to present what they’ve done by visually getting something they can plug in a presentation or something :)
Another idea is to allow user to optionally specify the names with a kwarg, as in names=["encoder", "transformer"]
or pass a dictionary.
Here are some tweaks to the syntax, given the introduction of learning network machines in https://github.com/alan-turing-institute/MLJBase.jl/pull/310 :
edited
# definition of learning network:
Xs= source()
...
yhat = predict(...)
W = transform(...)
# definition of learning network machine:
model = Unsupervised() # other options are `Probabilistic()` or `Deterministic()`
mach = machine(model, Xs; predict=yhat, transform=W)
# definition of new composite model type (returning a default instance):
@from_network mach begin
mutable struct MyComposite
preprocessor::OneHotEncoder=one_hot
clusterer=kmeans
end
input_scitype=Table(Continuous) # optional trait specification
end
Or we could have
@from_network mach mutable struct MyComposite
preprocessor::OneHotEncoder=one_hot
clusterer=kmeans
end
end
but that looks crowded, in my opinion.
As I turn to define more canned composite models (eg
@stack
,@continuous
) and with the issues #258, https://github.com/alan-turing-institute/MLJ.jl/issues/412, https://github.com/alan-turing-institute/MLJ.jl/issues/311 in mind, I find that I am not happy with the existing syntax for these macros.In short, the general purpose advanced-user macro
@from_network
is not sufficiently expressive, while the common use-case@pipeline
syntax could be simpler (because if it really doesn't actually do what you want it to, you can always use the more powerful @from_newtork to fix that). In an attempt to cast both these from the same syntactic mould, I settled on an unfortunate compromise.The following examples show more-or-less what I have in mind as substitutions:
A simpler
@pipeline
macroThe user doesn't get to name the field names of the composite - they are always just
model1
,models2
, and so forth, and the name of the new composite type is autogenerated, unless overwritten with kwargname=...
.If a scitype trait for the composite cannot be reliably determined, then it falls back to
Unknown
unless the user specifies a kwarg (as intarget_scitype = AstractVector{<:Finite}
). If the prediction type cannot be reliably determined, then a warning is issued and:deterministic
is used, unless user overrides withprediction_type=:probabilistic
.A more expressive (and Julia idiomatic)
@from_network
macroThe following example how to handle multiple operations (see issue cited above):
Again scitype traits that cannot be reliably deduced fallback to
Unknown
unless explicitly provided by kwargs.Thoughts anyone?