JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

Average of Dates #54542

Open alaindebecker opened 3 months ago

alaindebecker commented 3 months ago

That you cannot add Datesand cannot divide a Dates by an integer seams perfectly normal. However computing the mean of Dates is well founded and sometimes mostly needed.

Example:


using Statistics, Dates

mean([Date("2024-05-22"), Date("2024-05-20")])
### ERROR: MethodError: no method matching /(::Date, ::Int64)

sum([Date("2024-05-22"), Date("2024-05-20")])  ÷ 2
### ERROR: MethodError: no method matching +(::Date, ::Date)

mean([DateTime("2024-05-22"), DateTime("2024-05-20")])
ERROR: MethodError: no method matching /(::DateTime, ::Int64)

versioninfo()
#==
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
==#```
oscardssmith commented 3 months ago

What should this return for [Date("2024-05-22"), Date("2024-05-21")]? In general, there isn't a correct answer to this.

alaindebecker commented 3 months ago

Workaround: mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever) but the best strategy would be to return a DateTime and leave the question of rounding explicit to the user.

alaindebecker commented 3 months ago

To answer your question about the average of [Date("2024-05-22"), Date("2024-05-21")], I'd say that because the two dates are 24 hours accurate anyway, their mean cannot be better than 24 hours accurate, therefore 2024-05-22 and 2024-05-21 would equally answer to the question.

Joking apart, following the principle of this discussion about average of integers [return a Float and let the user explicitly decide about spurious accuracy], I'd say that the mean of 2024-05-22 and 2024-05-21 is

t1 = Date("2024-05-22") t2 = Date("2024-05-21") arg = [t1,t2] convert(DateTime, Millisecond(mean(Dates.value.(DateTime.(arg))))) = 2024-05-21T12:00:00`

which (extra Julia) I'd round to 2024-05-21 because t1 is in facts somwhere between 2024-05-21T00:00:00 and 2024-05-21T23:59:59 and t2 between 2024-05-22T00:00:00 and 2024-05-22T23:59:59. .

martinholters commented 3 months ago

Another line of thought: For two values, their mean is their "middle". For 2024-05-21 and 2024-05-22, their middle seems to be midnight, i.e. 2024-05-22T00:00:00, so their mean should be 2024-05-22.

No, it shouldn't. IMHO, we shouldn't define it, as it unclear what it should be. Let's not define mean for Dates.

martinholters commented 3 months ago

mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")])))))
2001-12-31

I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define mean for Dates?

StefanKarpinski commented 3 months ago

I'm a little unclear what the definition of average/mean when you can neither add nor divide by a count. Median and extrema seem well-defined to me, but mean feels a lot iffier...

alaindebecker commented 3 months ago

Generations of astronomers did it however. After all, for them time is just a number, the Julian day number.

Personally, I need it for a regression y=f(t) with t the time. And from time to time, I also need it when I have a bunch of events supposed to arise at about the same time, but are known to be normally distributed.

It is just like temperature: adding or dividing by a count have no meaning, but you find average temperature in any newspaper.

vtjnash commented 3 months ago

Computationally, it is also easy to define in a rigorous way, because while Date cannot be added, delta days can be. And we can conveniently pick day 0 for the arithmetic, which makes it seem like our Date kind and Days kind are almost alike in units (although in strict mathematics, they are not):

julia> x = [Date("2024-05-22"), Date("2024-05-20")];

julia> d0 = Date("0000-01-01")
0000-01-01

julia> mean(x .- d0) .+ d0
2024-05-21
jariji commented 3 months ago

I think the decision depends on the rules for dividing Day(n) by an integer in the definition of mean. Some options:

  1. It fails always.

  2. It works if it's divisible and errors otherwise like the current behavior of /(::Day, ::Int):

julia> Day(2)/2
1 day

julia> Day(1)/2
ERROR: InexactError: Int64(0.5)
  1. It rounds to a full Day using some rounding rule. That's consistent with a "fixed point" interpretation of date and datetime types.

  2. It promotes to DateTime. That's consistent with the behavior of 1/2 promoting to float.

Imho it would be best if we could define precise semantics for the date type and for mean so that the answer to this issue would follow unambiguously. I don't like that /(::Day, ::Int) and /(::Int, ::Int) seem to follow different principles.

alaindebecker commented 3 months ago

Totally agree with @vtjnash: a Date is a point in Time, and Time is continuous.

alaindebecker commented 3 months ago

jariji : I suggest solution 4, although I usually use solution 3, rounding down.

The resaon is that a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight. So a Date refers to a point in time which is (on average) 12 hours after its literal value. And a the mean of a bunch of Dates will be on average 12 hours after Integer(DateTime.value).

However, solution 4 is in accordance with Julia philosophy : let the rouding rule be explicitely stated by the user.

Solution 1 and 2 are just painfull.

alaindebecker commented 3 months ago

mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")])))))
2001-12-31

I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define mean for Dates?

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct, just as

convert(DateTime, Millisecond(mean(Dates.value.([DateTime("2000-01-01"), DateTime("2004-01-01")]))))
2001-12-31T12:00:00

Maybe I rephrase the issue title in Average of Time (time not beeing a Julia type).

alaindebecker commented 3 months ago

Workaround: mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever) but the best strategy would be to return a DateTime and leave the question of rounding explicit to the user :

Workaround: mean_dates(dates) = convert(DateTime, Millisecond(mean(Dates.value.(dates))))

martinholters commented 3 months ago

a Date is a point in Time, and Time is continuous

a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight

So... a Date is a specific but unknown (within the day) point in time?

To me, a date is rather an interval (usually of 24 hours length).

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct

Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?

If I look at this:

julia> for y in 2000:2020
           d1 = Date(y)
           d2 = Date(y+4)
           m = convert(Date, Day(round(mean(Dates.value.([d1, d2])))))
           println("Mean of $(d1) and $(d2) is $(m)")
       end
Mean of 2000-01-01 and 2004-01-01 is 2001-12-31
Mean of 2001-01-01 and 2005-01-01 is 2003-01-01
Mean of 2002-01-01 and 2006-01-01 is 2004-01-02
Mean of 2003-01-01 and 2007-01-01 is 2004-12-31
Mean of 2004-01-01 and 2008-01-01 is 2006-01-01
Mean of 2005-01-01 and 2009-01-01 is 2007-01-02
Mean of 2006-01-01 and 2010-01-01 is 2008-01-01
Mean of 2007-01-01 and 2011-01-01 is 2009-01-01
Mean of 2008-01-01 and 2012-01-01 is 2009-12-31
Mean of 2009-01-01 and 2013-01-01 is 2011-01-01
Mean of 2010-01-01 and 2014-01-01 is 2012-01-02
Mean of 2011-01-01 and 2015-01-01 is 2012-12-31
Mean of 2012-01-01 and 2016-01-01 is 2014-01-01
Mean of 2013-01-01 and 2017-01-01 is 2015-01-02
Mean of 2014-01-01 and 2018-01-01 is 2016-01-01
Mean of 2015-01-01 and 2019-01-01 is 2017-01-01
Mean of 2016-01-01 and 2020-01-01 is 2017-12-31
Mean of 2017-01-01 and 2021-01-01 is 2019-01-01
Mean of 2018-01-01 and 2022-01-01 is 2020-01-02
Mean of 2019-01-01 and 2023-01-01 is 2020-12-31
Mean of 2020-01-01 and 2024-01-01 is 2022-01-01

I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)

adienes commented 3 months ago

I think it would be a very bad choice to make the mean of Date to round to a Date like is proposed. To me it feels pretty hacky and confusing.

the mean of integers is non-integral...

alaindebecker commented 3 months ago

I think it would be a very bad choice to make the mean of Date to round to a Date like is proposed. To me it feels pretty hacky and confusing.

the mean of integers is non-integral...

That why I said (but nobody seams to have read it): the best strategy would be to return a DateTime and leave the question of rounding explicit to the user.

martinholters commented 3 months ago

I'd be ok with defining mean for DateTime. However, I'm not sure the default convert(DateTime, ::Date) should be automatically invoked to also define mean for Date then.

alaindebecker commented 3 months ago

a Date is a point in Time, and Time is continuous

a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight

So... a Date is a specific but unknown (within the day) point in time?

To me, a date is rather an interval (usually of 24 hours length).

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct

Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?

If I look at this:

julia> for y in 2000:2020
           d1 = Date(y)
           d2 = Date(y+4)
           m = convert(Date, Day(round(mean(Dates.value.([d1, d2])))))
           println("Mean of $(d1) and $(d2) is $(m)")
       end
Mean of 2000-01-01 and 2004-01-01 is 2001-12-31
Mean of 2001-01-01 and 2005-01-01 is 2003-01-01
Mean of 2002-01-01 and 2006-01-01 is 2004-01-02
Mean of 2003-01-01 and 2007-01-01 is 2004-12-31
Mean of 2004-01-01 and 2008-01-01 is 2006-01-01
Mean of 2005-01-01 and 2009-01-01 is 2007-01-02
Mean of 2006-01-01 and 2010-01-01 is 2008-01-01
Mean of 2007-01-01 and 2011-01-01 is 2009-01-01
Mean of 2008-01-01 and 2012-01-01 is 2009-12-31
Mean of 2009-01-01 and 2013-01-01 is 2011-01-01
Mean of 2010-01-01 and 2014-01-01 is 2012-01-02
Mean of 2011-01-01 and 2015-01-01 is 2012-12-31
Mean of 2012-01-01 and 2016-01-01 is 2014-01-01
Mean of 2013-01-01 and 2017-01-01 is 2015-01-02
Mean of 2014-01-01 and 2018-01-01 is 2016-01-01
Mean of 2015-01-01 and 2019-01-01 is 2017-01-01
Mean of 2016-01-01 and 2020-01-01 is 2017-12-31
Mean of 2017-01-01 and 2021-01-01 is 2019-01-01
Mean of 2018-01-01 and 2022-01-01 is 2020-01-02
Mean of 2019-01-01 and 2023-01-01 is 2020-12-31
Mean of 2020-01-01 and 2024-01-01 is 2022-01-01

I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)

You hit what Julia calls calendrical vs temporal nature of time (see doc). Since Babylonian astromomers, you record time on a calendar and compute time on $\mathbb{R}$. To bad the year is not an exact number of days.

alaindebecker commented 3 months ago

I'd be ok with defining mean for DateTime. However, I'm not sure the default convert(DateTime, ::Date) should be automatically invoked to also define mean for Date then.

Something like ?

function mean(itr::AbstractArray{Dates.DateTime})
    return convert(Dates.DateTime, Millisecond(Statistics.mean(Dates.value.(itr))))
end

Example:

using Dates
mean([DateTime("2024-01-01"),DateTime("2025-01-01")])
2024-07-02T00:00:00
cstjean commented 3 months ago

Unitful has the same problem with regard to °C. It's interesting to consider that a generic fallback definition along the lines of Statistics.mean(vec) = zero(eltype(vec)) + sum(vec .- zero(eltype(vec))) / length(vec) would solve both problems. I guess they're both affine spaces...?

In any case, it's probably more pragmatic to define specialized methods for mean of Date and Quantity vectors, than change the generic mean fallback.

adienes commented 3 months ago

in case the reference is useful, dropping the link to the polars meta-issue for aggregations of datetime-like types https://github.com/pola-rs/polars/issues/13599

alaindebecker commented 3 months ago

Thanks @adienes, exactly what was expected.