JuliaAI / MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
MIT License
81 stars 27 forks source link

Date to linearized continuous variable for trend prediction #234

Closed venuur closed 3 years ago

venuur commented 4 years ago

When forecasting, I often transform a date feature into a linearized version of it to make it a continuous variable for linear regression, e.g.

dt_feature | linearized_dt_feature
2018-01-01 | 1
2018-01-02 | 2
2018-01-03 | 3

This seems like a good use case for a transformer for MLJ modeled after OneHotEncoder. I created a simple one for my forecasting project. I am interested in contributing the implementation, but I wanted to review the concept here before going through the effort to make a pull request.

ablaom commented 4 years ago

Closing in favour of https://github.com/alan-turing-institute/MLJScientificTypes.jl/issues/22

ablaom commented 4 years ago

Re-opening (see discussion in above issue ).

I expect we should first agree on the following:

  1. How do we determine the choice of scale? Would this be one day = 1 for Date? What about other subtypes of TimeType (specifically, Time and DateTime)?

  2. How do we determine the choice of origin (the time corresponding to zero) if not explicitly specified as a parameter (which would make sense as an option, no?)?

  3. If a table has multiple time columns, are the choices above independent or coupled?

  4. Is there a need for an inverse transform?

My suggestions:

  1. For Date, Time and DateTime, we universally use one Day = 1.0.

  2. For Time zero is always 00:00 hours; for Date it is the minimum date encountered in all Date and DateTime columns, unless overriden by user specified hyperparameter zero_date; for DateTime it is 00:00 hours on whatever we use for the Date columns.

  3. Yes, they are coupled, as explained above.

  4. This could be added later, but we should include the information needed to go backwards in the learned parameters (fitresult).

Any other suggestions?

@venuur I wonder if @venuur could comment on the choice of name "Linearization"? Is there linearisation going on here? I mean, what is non-linear here? Maybe TimeTypeToContinuous{T} is a better name (with T <: TimeType)? Open to other suggestions.

I would like us to first define a Univariate version (for single vectors) first, as we have done for Standardizer (there is also UnivariateStandardizer). Ultimately, Standardizer could deal with both tables and vectors (haven't got around to this for Standardizer) so user doesn't have to remember so many models.

An advanced option we could provide is to split Time (time during the day) into two or more columns corresponding to Fourier expanding. So in two column case, this would just be the sine and cosine of the continuous version above (base on a period of 1 Day). And we could do the same for the other time types, where the period is one year. This is substantially more work however. Some of the code in OneHotEncoder could be helpful here (spawning new columns, which need new names).

venuur commented 4 years ago

First, thank you for the detailed thoughts. Definitely this model will be more robust if we settle these questions appropriately.

I'll address your four initial questions.

  1. I think a step of Day(1) makes sense in the context of Date and DateTime, but we need something else for Time, since Time cannot have a Day(1) added. For example:
julia> Date(now()) + Day(1)
2020-04-24

julia> DateTime(now()) + Day(1)
2020-04-24T11:51:06.081

julia> Time(now()) + Day(1)
ERROR: MethodError: no method matching +(::Time, ::Day)
Closest candidates are:
  +(::TimeType, ::Period, ::Period) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Dates\src\periods.jl:360
  +(::TimeType, ::Period, ::Period, ::Period...) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Dates\src\periods.jl:361
  +(::Any, ::Any, ::Any, ::Any...) at operators.jl:529
  ...
Stacktrace:
 [1] top-level scope at REPL[24]:1

I think Hour could be a reasonable default step, but I also think it would be reasonable to require the user to provide something in the Time case.

  1. I agree with your zeros.

  2. Agreed.

  3. I think with the following parameter setup, we can provide inversion (eventually anyway). Given we have two hyper parameters zero_date=t::(T<:TimeType) and step=p::Period and the increment from 0.0 to 1.0 is what we get from t to t+p. This would also make inverting straight forward either from the hyper parameter if provided or the fitresult found by the minimum value.

My original name choice isn't very important to me, but for the sake of full transparency, here's the explanation. I used "Linearization" as a reminder that if you use this feature in a linear regression it implies a linear trend feature, but that's a model specific interpretation. I like the name you proposed, so I will draft my first version with that name.

I agree with the univariate first approach.

My summary of the plan is

  1. Implement UnivariateTimeTypeToContinuous{T} where {T <: TimeType} based on my current prototype focusing on correct behavior for all three time types.
  2. Implement TimeTypeToContinuous{T} where {T <: TimeType}, handling multiple columns.
  3. Add inverse functionality.

I like the idea of Fourier expansion, though I think that should be a separate issue to discuss the different modeling options. I'm most familiar with expanding date times to create categorical labels such as "day of week", "month in year", "hour in day", "X-period in day". It'd be a super useful, but I'm not sure whether the right output is continuous or multiclass. I've always treated them as the latter and used something like one-hot encoding to then convert them to continuous. It'd be great to get your opinion on this distinction.

venuur commented 4 years ago

I created a first pass implementation on my fork: https://github.com/venuur/MLJModels.jl/blob/venuur-add-timetype-continuous/src/builtins/Transformers.jl#L329

@ablaom There's still some tricky corner cases about how to define the type of the transformer and what type the output should be if the TimeType differs between the transformer and the feature vector. I would appreciate if you could review it for high level feedback, while I think through those corner cases.

ablaom commented 4 years ago

Many thanks for that! Sounds like we're basically on the same page.

Re 1 and

julia> DateTime(now()) + Day(1)
2020-04-24T11:51:06.081

What I meant is let the scale be 24 hours:

julia> DateTime(now()) + Hour(24)
2020-04-30T08:52:24.281

I think for consistency and concpeptual consistency it makes sense to keep all the scales the same. Do you have a use case for making them different?

Regarding a review. Could you please make a PR, which would make this easier for me. You can mark it as WIP or Draft if you like and can make a completely new PR later if that suits.

ablaom commented 4 years ago

cc: @vollmersj (time series)

venuur commented 4 years ago

I agree the scales should be the same. I didn’t think of using Hour(24). I’ll update the code to use that and then setup a PR to review.