JuliaAI / MLJScientificTypes.jl

Implementation of the MLJ scientific type convention
MIT License
17 stars 6 forks source link

Add coerce methods to convert DateTime, Time and Date to Continuous #22

Closed ablaom closed 4 years ago

ablaom commented 4 years ago

Further context: https://github.com/alan-turing-institute/MLJModels.jl/issues/234

This requires implementation of

coerce(::Arr{T}, ::Type{<:Union{Missing,Continuous}}; 
    verbosity::Int=1, tight::Bool=false) where T<:Union{Missing,DateTime}

in analogy with similar implementation at https://github.com/alan-turing-institute/MLJScientificTypes.jl/blob/b65844b14a0335ef1f936e39e07b0036acd75cb7/src/convention/coerce.jl#L52 (for coercing arbitrary Real data, eg, Int, to Continuous element scitype). Here Arr is an alias for AbstractArray valid in that code block, which is where the new implementations should live.

Similar for Time and Date, although it may be possible to combine the implementations into one, or at least suck out the common code in a new function dispatched on the actual type.

For convenience, the coerce doc-string is copied below.

    coerce(A, ...; tight=false, verbosity=1)

Given a table `A`, return a copy of `A` ensuring that the scitype of the
columns match new specifications.
The specifications can be given as a a bunch of `colname=>Scitype` pairs or
as a dictionary whose keys are names and values are scientific types:

coerce(X, col1=>scitype1, col2=>scitype2, ... ; verbosity=1) coerce(X, d::AbstractDict; verbosity=1)

One can also specify pairs of type `T1=>T2` in which case all columns with
scientific element type subtyping `Union{T1,Missing}` will be coerced to the
new specified scitype `T2`.

## Examples

Specifiying (name, scitype) pairs:

using CategoricalArrays, DataFrames, Tables X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"], height=[152, missing, 148, 163], rating=[1, 5, 2, 1]) Xc = coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor) schema(Xc).scitypes # (Multiclass, Continuous, OrderedFactor)

Specifying (T1, T2) pairs:

X = (x = [1, 2, 3], y = rand(3), z = [10, 20, 30]) Xc = coerce(X, Count=>Continuous) schema(Xfixed).scitypes # (Continuous, Continuous, Continuous)

ablaom commented 4 years ago

@venuur If you are still happy to help out, please assign yourself to this issue. You might want to wait for #23 to merge, for then you already have Dates in the Project.toml.

Please make any PR to the dev branch, as per contributing guidelines

Don't hesitate to ask questions here or on slack.

venuur commented 4 years ago

Happy to help. I should get around to it later in the week.

venuur commented 4 years ago

I have two initial questions.

First, what does the argument tight mean?

Second, I originally viewed this as a fit-transform type transformer, like OneHotEncoder. The reasoning is that the linearization can then be extrapolated correctly when running prediction, continuing the example from the original issue. Suppose the training features looks like the following

dt_feature | linearized_dt_feature
2018-01-01 | 1
2018-01-02 | 2
2018-01-03 | 3

If we then want to predict for the dates 2018-01-04, 2018-01-05, then the correct linearization is the following:

dt_feature | linearized_dt_feature
2018-01-04 | 4
2018-01-05 | 5

Here's the implementation of the fit-transform variant for reference from my current forecasting code:

# Linearize dates to fit and predict linear trends correctly.
MLJModelInterface.@mlj_model mutable struct DateLinearizer <: MLJModelInterface.Unsupervised
    features::Vector{Symbol} = Symbol[]

function MLJModelInterface.fit(model::DateLinearizer, verbosity::Int, X)
    fitresult = [minimum(getproperty(X, c)) for c in model.features]
    cache = nothing
    report = nothing
    return fitresult, cache, report

function MLJModelInterface.transform(model::DateLinearizer, fitresult, X)
    let X = copy(X)
        for (c, min_dt) in zip(model.features, fitresult)
            dts = getproperty(X, c)
            setproperty!(X, c, @. Float64(Dates.value(dts - min_dt)))
venuur commented 4 years ago

Forgot to actually ask my question in the second part. Do you think this still makes since as a method of coerce given the use case I described, or am I conflating two different problems?

ablaom commented 4 years ago

No, you're right. You have learned parameters here and so a transformer makes better sense.