Closed ablaom closed 4 years ago
@venuur If you are still happy to help out, please assign yourself to this issue. You might want to wait for #23 to merge, for then you already have Dates in the Project.toml.
Please make any PR to the dev branch, as per contributing guidelines
Don't hesitate to ask questions here or on slack.
Happy to help. I should get around to it later in the week.
I have two initial questions.
First, what does the argument tight
mean?
Second, I originally viewed this as a fit-transform type transformer, like OneHotEncoder
. The reasoning is that the linearization can then be extrapolated correctly when running prediction, continuing the example from the original issue. Suppose the training features looks like the following
dt_feature | linearized_dt_feature
2018-01-01 | 1
2018-01-02 | 2
2018-01-03 | 3
If we then want to predict for the dates 2018-01-04, 2018-01-05, then the correct linearization is the following:
dt_feature | linearized_dt_feature
2018-01-04 | 4
2018-01-05 | 5
Here's the implementation of the fit-transform variant for reference from my current forecasting code:
# Linearize dates to fit and predict linear trends correctly.
MLJModelInterface.@mlj_model mutable struct DateLinearizer <: MLJModelInterface.Unsupervised
features::Vector{Symbol} = Symbol[]
end
function MLJModelInterface.fit(model::DateLinearizer, verbosity::Int, X)
fitresult = [minimum(getproperty(X, c)) for c in model.features]
cache = nothing
report = nothing
return fitresult, cache, report
end
function MLJModelInterface.transform(model::DateLinearizer, fitresult, X)
let X = copy(X)
for (c, min_dt) in zip(model.features, fitresult)
dts = getproperty(X, c)
setproperty!(X, c, @. Float64(Dates.value(dts - min_dt)))
end
X
end
end
Forgot to actually ask my question in the second part. Do you think this still makes since as a method of coerce
given the use case I described, or am I conflating two different problems?
No, you're right. You have learned parameters here and so a transformer makes better sense.
Further context: https://github.com/alan-turing-institute/MLJModels.jl/issues/234
This requires implementation of
in analogy with similar implementation at https://github.com/alan-turing-institute/MLJScientificTypes.jl/blob/b65844b14a0335ef1f936e39e07b0036acd75cb7/src/convention/coerce.jl#L52 (for coercing arbitrary
Real
data, eg,Int
, toContinuous
element scitype). HereArr
is an alias forAbstractArray
valid in that code block, which is where the new implementations should live.Similar for
Time
andDate
, although it may be possible to combine the implementations into one, or at least suck out the common code in a new function dispatched on the actual type.For convenience, the
coerce
doc-string is copied below.coerce(X, col1=>scitype1, col2=>scitype2, ... ; verbosity=1) coerce(X, d::AbstractDict; verbosity=1)
using CategoricalArrays, DataFrames, Tables X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"], height=[152, missing, 148, 163], rating=[1, 5, 2, 1]) Xc = coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor) schema(Xc).scitypes # (Multiclass, Continuous, OrderedFactor)
X = (x = [1, 2, 3], y = rand(3), z = [10, 20, 30]) Xc = coerce(X, Count=>Continuous) schema(Xfixed).scitypes # (Continuous, Continuous, Continuous)