Open alaindebecker opened 3 months ago
What should this return for [Date("2024-05-22"), Date("2024-05-21")]
? In general, there isn't a correct answer to this.
Workaround:
mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))
Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever)
but the best strategy would be to return a DateTime
and leave the question of rounding explicit to the user.
To answer your question about the average of [Date("2024-05-22"), Date("2024-05-21")]
, I'd say that because the two dates are 24 hours accurate anyway, their mean cannot be better than 24 hours accurate, therefore 2024-05-22 and 2024-05-21 would equally answer to the question.
Joking apart, following the principle of this discussion about average of integers [return a Float and let the user explicitly decide about spurious accuracy], I'd say that the mean of 2024-05-22 and 2024-05-21 is
t1 = Date("2024-05-22") t2 = Date("2024-05-21") arg = [t1,t2] convert(DateTime, Millisecond(mean(Dates.value.(DateTime.(arg))))) = 2024-05-21T12:00:00`
which (extra Julia) I'd round to 2024-05-21 because t1 is in facts somwhere between 2024-05-21T00:00:00 and 2024-05-21T23:59:59 and t2 between 2024-05-22T00:00:00 and 2024-05-22T23:59:59. .
Another line of thought: For two values, their mean is their "middle". For 2024-05-21 and 2024-05-22, their middle seems to be midnight, i.e. 2024-05-22T00:00:00, so their mean should be 2024-05-22.
No, it shouldn't. IMHO, we shouldn't define it, as it unclear what it should be. Let's not define mean
for Date
s.
mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))
julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")])))))
2001-12-31
I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define mean
for Date
s?
I'm a little unclear what the definition of average/mean when you can neither add nor divide by a count. Median and extrema seem well-defined to me, but mean feels a lot iffier...
Generations of astronomers did it however. After all, for them time is just a number, the Julian day number.
Personally, I need it for a regression y=f(t) with t the time. And from time to time, I also need it when I have a bunch of events supposed to arise at about the same time, but are known to be normally distributed.
It is just like temperature: adding or dividing by a count have no meaning, but you find average temperature in any newspaper.
Computationally, it is also easy to define in a rigorous way, because while Date cannot be added, delta days can be. And we can conveniently pick day 0 for the arithmetic, which makes it seem like our Date kind and Days kind are almost alike in units (although in strict mathematics, they are not):
julia> x = [Date("2024-05-22"), Date("2024-05-20")];
julia> d0 = Date("0000-01-01")
0000-01-01
julia> mean(x .- d0) .+ d0
2024-05-21
I think the decision depends on the rules for dividing Day(n)
by an integer in the definition of mean
. Some options:
It fails always.
It works if it's divisible and errors otherwise like the current behavior of /(::Day, ::Int)
:
julia> Day(2)/2
1 day
julia> Day(1)/2
ERROR: InexactError: Int64(0.5)
It rounds to a full Day
using some rounding rule. That's consistent with a "fixed point" interpretation of date and datetime types.
It promotes to DateTime
. That's consistent with the behavior of 1/2
promoting to float.
Imho it would be best if we could define precise semantics for the date type and for mean
so that the answer to this issue would follow unambiguously. I don't like that /(::Day, ::Int)
and /(::Int, ::Int)
seem to follow different principles.
Totally agree with @vtjnash: a Date is a point in Time, and Time is continuous.
jariji : I suggest solution 4, although I usually use solution 3, rounding down.
The resaon is that a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight. So a Date refers to a point in time which is (on average) 12 hours after its literal value. And a the mean of a bunch of Dates will be on average 12 hours after Integer(DateTime.value)
.
However, solution 4 is in accordance with Julia philosophy : let the rouding rule be explicitely stated by the user.
Solution 1 and 2 are just painfull.
mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))
julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")]))))) 2001-12-31
I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define
mean
forDate
s?
Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct, just as
convert(DateTime, Millisecond(mean(Dates.value.([DateTime("2000-01-01"), DateTime("2004-01-01")]))))
2001-12-31T12:00:00
Maybe I rephrase the issue title in Average of Time (time not beeing a Julia type).
Workaround:
mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))
Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever)
but the best strategy would be to return a DateTime
and leave the question of rounding explicit to the user :
Workaround: mean_dates(dates) = convert(DateTime, Millisecond(mean(Dates.value.(dates))))
a Date is a point in Time, and Time is continuous
a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight
So... a Date is a specific but unknown (within the day) point in time?
To me, a date is rather an interval (usually of 24 hours length).
Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct
Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?
If I look at this:
julia> for y in 2000:2020
d1 = Date(y)
d2 = Date(y+4)
m = convert(Date, Day(round(mean(Dates.value.([d1, d2])))))
println("Mean of $(d1) and $(d2) is $(m)")
end
Mean of 2000-01-01 and 2004-01-01 is 2001-12-31
Mean of 2001-01-01 and 2005-01-01 is 2003-01-01
Mean of 2002-01-01 and 2006-01-01 is 2004-01-02
Mean of 2003-01-01 and 2007-01-01 is 2004-12-31
Mean of 2004-01-01 and 2008-01-01 is 2006-01-01
Mean of 2005-01-01 and 2009-01-01 is 2007-01-02
Mean of 2006-01-01 and 2010-01-01 is 2008-01-01
Mean of 2007-01-01 and 2011-01-01 is 2009-01-01
Mean of 2008-01-01 and 2012-01-01 is 2009-12-31
Mean of 2009-01-01 and 2013-01-01 is 2011-01-01
Mean of 2010-01-01 and 2014-01-01 is 2012-01-02
Mean of 2011-01-01 and 2015-01-01 is 2012-12-31
Mean of 2012-01-01 and 2016-01-01 is 2014-01-01
Mean of 2013-01-01 and 2017-01-01 is 2015-01-02
Mean of 2014-01-01 and 2018-01-01 is 2016-01-01
Mean of 2015-01-01 and 2019-01-01 is 2017-01-01
Mean of 2016-01-01 and 2020-01-01 is 2017-12-31
Mean of 2017-01-01 and 2021-01-01 is 2019-01-01
Mean of 2018-01-01 and 2022-01-01 is 2020-01-02
Mean of 2019-01-01 and 2023-01-01 is 2020-12-31
Mean of 2020-01-01 and 2024-01-01 is 2022-01-01
I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)
I think it would be a very bad choice to make the mean of Date
to round to a Date
like is proposed. To me it feels pretty hacky and confusing.
the mean of integers is non-integral...
I think it would be a very bad choice to make the mean of
Date
to round to aDate
like is proposed. To me it feels pretty hacky and confusing.the mean of integers is non-integral...
That why I said (but nobody seams to have read it): the best strategy would be to return a DateTime
and leave the question of rounding explicit to the user.
I'd be ok with defining mean
for DateTime
. However, I'm not sure the default convert(DateTime, ::Date)
should be automatically invoked to also define mean
for Date
then.
a Date is a point in Time, and Time is continuous
a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight
So... a Date is a specific but unknown (within the day) point in time?
To me, a date is rather an interval (usually of 24 hours length).
Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct
Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?
If I look at this:
julia> for y in 2000:2020 d1 = Date(y) d2 = Date(y+4) m = convert(Date, Day(round(mean(Dates.value.([d1, d2]))))) println("Mean of $(d1) and $(d2) is $(m)") end Mean of 2000-01-01 and 2004-01-01 is 2001-12-31 Mean of 2001-01-01 and 2005-01-01 is 2003-01-01 Mean of 2002-01-01 and 2006-01-01 is 2004-01-02 Mean of 2003-01-01 and 2007-01-01 is 2004-12-31 Mean of 2004-01-01 and 2008-01-01 is 2006-01-01 Mean of 2005-01-01 and 2009-01-01 is 2007-01-02 Mean of 2006-01-01 and 2010-01-01 is 2008-01-01 Mean of 2007-01-01 and 2011-01-01 is 2009-01-01 Mean of 2008-01-01 and 2012-01-01 is 2009-12-31 Mean of 2009-01-01 and 2013-01-01 is 2011-01-01 Mean of 2010-01-01 and 2014-01-01 is 2012-01-02 Mean of 2011-01-01 and 2015-01-01 is 2012-12-31 Mean of 2012-01-01 and 2016-01-01 is 2014-01-01 Mean of 2013-01-01 and 2017-01-01 is 2015-01-02 Mean of 2014-01-01 and 2018-01-01 is 2016-01-01 Mean of 2015-01-01 and 2019-01-01 is 2017-01-01 Mean of 2016-01-01 and 2020-01-01 is 2017-12-31 Mean of 2017-01-01 and 2021-01-01 is 2019-01-01 Mean of 2018-01-01 and 2022-01-01 is 2020-01-02 Mean of 2019-01-01 and 2023-01-01 is 2020-12-31 Mean of 2020-01-01 and 2024-01-01 is 2022-01-01
I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)
You hit what Julia calls calendrical vs temporal nature of time (see doc). Since Babylonian astromomers, you record time on a calendar and compute time on $\mathbb{R}$. To bad the year is not an exact number of days.
I'd be ok with defining
mean
forDateTime
. However, I'm not sure the defaultconvert(DateTime, ::Date)
should be automatically invoked to also definemean
forDate
then.
Something like ?
function mean(itr::AbstractArray{Dates.DateTime})
return convert(Dates.DateTime, Millisecond(Statistics.mean(Dates.value.(itr))))
end
Example:
using Dates
mean([DateTime("2024-01-01"),DateTime("2025-01-01")])
2024-07-02T00:00:00
Unitful has the same problem with regard to °C. It's interesting to consider that a generic fallback definition along the lines of Statistics.mean(vec) = zero(eltype(vec)) + sum(vec .- zero(eltype(vec))) / length(vec)
would solve both problems. I guess they're both affine spaces...?
In any case, it's probably more pragmatic to define specialized methods for mean
of Date and Quantity vectors, than change the generic mean
fallback.
in case the reference is useful, dropping the link to the polars
meta-issue for aggregations of datetime-like types
https://github.com/pola-rs/polars/issues/13599
Thanks @adienes, exactly what was expected.
That you cannot add
Dates
and cannot divide aDates
by an integer seams perfectly normal. However computing the mean ofDates
is well founded and sometimes mostly needed.Example: