Closed venuur closed 3 years ago
Closing in favour of https://github.com/alan-turing-institute/MLJScientificTypes.jl/issues/22
Re-opening (see discussion in above issue ).
I expect we should first agree on the following:
How do we determine the choice of scale? Would this be one day = 1 for Date? What about other subtypes of TimeType
(specifically, Time
and DateTime
)?
How do we determine the choice of origin (the time corresponding to zero) if not explicitly specified as a parameter (which would make sense as an option, no?)?
If a table has multiple time columns, are the choices above independent or coupled?
Is there a need for an inverse transform?
My suggestions:
For Date
, Time
and DateTime
, we universally use one Day
= 1.0.
For Time
zero is always 00:00 hours; for Date
it is the minimum date encountered in all Date
and DateTime
columns, unless overriden by user specified hyperparameter zero_date
; for DateTime
it is 00:00 hours on whatever we use for the Date
columns.
Yes, they are coupled, as explained above.
This could be added later, but we should include the information needed to go backwards in the learned parameters (fitresult
).
Any other suggestions?
@venuur I wonder if @venuur could comment on the choice of name "Linearization"? Is there linearisation going on here? I mean, what is non-linear here? Maybe TimeTypeToContinuous{T}
is a better name (with T <: TimeType
)? Open to other suggestions.
I would like us to first define a Univariate
version (for single vectors) first, as we have done for Standardizer
(there is also UnivariateStandardizer
). Ultimately, Standardizer
could deal with both tables and vectors (haven't got around to this for Standardizer
) so user doesn't have to remember so many models.
An advanced option we could provide is to split Time
(time during the day) into two or more columns corresponding to Fourier expanding. So in two column case, this would just be the sine and cosine of the continuous version above (base on a period of 1 Day). And we could do the same for the other time types, where the period is one year. This is substantially more work however. Some of the code in OneHotEncoder could be helpful here (spawning new columns, which need new names).
First, thank you for the detailed thoughts. Definitely this model will be more robust if we settle these questions appropriately.
I'll address your four initial questions.
Day(1)
makes sense in the context of Date
and DateTime
, but we need something else for Time
, since Time
cannot have a Day(1)
added. For example:julia> Date(now()) + Day(1)
2020-04-24
julia> DateTime(now()) + Day(1)
2020-04-24T11:51:06.081
julia> Time(now()) + Day(1)
ERROR: MethodError: no method matching +(::Time, ::Day)
Closest candidates are:
+(::TimeType, ::Period, ::Period) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Dates\src\periods.jl:360
+(::TimeType, ::Period, ::Period, ::Period...) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Dates\src\periods.jl:361
+(::Any, ::Any, ::Any, ::Any...) at operators.jl:529
...
Stacktrace:
[1] top-level scope at REPL[24]:1
I think Hour
could be a reasonable default step, but I also think it would be reasonable to require the user to provide something in the Time
case.
I agree with your zeros.
Agreed.
I think with the following parameter setup, we can provide inversion (eventually anyway). Given we have two hyper parameters zero_date=t::(T<:TimeType)
and step=p::Period
and the increment from 0.0
to 1.0
is what we get from t
to t+p
. This would also make inverting straight forward either from the hyper parameter if provided or the fitresult
found by the minimum value.
My original name choice isn't very important to me, but for the sake of full transparency, here's the explanation. I used "Linearization" as a reminder that if you use this feature in a linear regression it implies a linear trend feature, but that's a model specific interpretation. I like the name you proposed, so I will draft my first version with that name.
I agree with the univariate first approach.
My summary of the plan is
UnivariateTimeTypeToContinuous{T} where {T <: TimeType}
based on my current prototype focusing on correct behavior for all three time types.TimeTypeToContinuous{T} where {T <: TimeType}
, handling multiple columns.I like the idea of Fourier expansion, though I think that should be a separate issue to discuss the different modeling options. I'm most familiar with expanding date times to create categorical labels such as "day of week", "month in year", "hour in day", "X-period in day". It'd be a super useful, but I'm not sure whether the right output is continuous or multiclass. I've always treated them as the latter and used something like one-hot encoding to then convert them to continuous. It'd be great to get your opinion on this distinction.
I created a first pass implementation on my fork: https://github.com/venuur/MLJModels.jl/blob/venuur-add-timetype-continuous/src/builtins/Transformers.jl#L329
@ablaom There's still some tricky corner cases about how to define the type of the transformer and what type the output should be if the TimeType differs between the transformer and the feature vector. I would appreciate if you could review it for high level feedback, while I think through those corner cases.
Many thanks for that! Sounds like we're basically on the same page.
Re 1 and
julia> DateTime(now()) + Day(1)
2020-04-24T11:51:06.081
What I meant is let the scale be 24 hours:
julia> DateTime(now()) + Hour(24)
2020-04-30T08:52:24.281
I think for consistency and concpeptual consistency it makes sense to keep all the scales the same. Do you have a use case for making them different?
Regarding a review. Could you please make a PR, which would make this easier for me. You can mark it as WIP or Draft if you like and can make a completely new PR later if that suits.
cc: @vollmersj (time series)
I agree the scales should be the same. I didn’t think of using Hour(24)
. I’ll update the code to use that and then setup a PR to review.
When forecasting, I often transform a date feature into a linearized version of it to make it a continuous variable for linear regression, e.g.
This seems like a good use case for a transformer for MLJ modeled after
OneHotEncoder
. I created a simple one for my forecasting project. I am interested in contributing the implementation, but I wanted to review the concept here before going through the effort to make a pull request.