JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
72 stars 40 forks source link

middle is undefined for Date and DateTime (so median does not work) #47

Open jameslefevre opened 4 years ago

jameslefevre commented 4 years ago

I have this working code:

using Dates
import Statistics.middle
middle(d1::Date, d2::Date) = Date(Dates.UTD(Int(round(middle(Dates.value(d1),Dates.value(d2))))))
middle(dt1::DateTime, dt2::DateTime) = DateTime(Dates.UTM(Int(round(middle(Dates.value(dt1),Dates.value(dt2))))))

But I wanted to quickly check that this is an appropriate approach before putting in a pull request. I saw open pull request #28 (relax type definition of middle), but I don't think helps - Date and DateTime should not subtype either Real or Number.

krisztianpinter commented 4 years ago

the problem runs deeper. apparently Statistics.jl assumes vector spaces, but most things work in affine spaces as well. in particular mean is the same concept as barycenter in affine spaces. DateTime forms an affine space, and thus mean should be defined for it, but it isn't as it is implemented now.

adding a specialization for DateTime to mean (and middle) seems like a patchwork. the true solution would be an implementation of mean which is affine space aware. consider:

mean t = t_base + sum (t_i - t_base) / n

it raises a number of issues, namely the choice of t_base, which is arbitrary in theory, but might introduce practical differences. we also don't want to work with numbers in this way for performance reasons, so some kind of Holy traits might come to the rescue.

nalimilan commented 4 years ago

This issue is about median so that's totally independent from mean.

@quinnj Do you think the definition proposed in the OP is correct? Or would it make sense to define middle in terms of x + (y-x)/2?

quinnj commented 4 years ago

OP definition seems.....fine. It's just treating Date/DateTime as raw integer values, then constructing directly from results. I think you can just do round(Int, ...) though instead of Int(round(...)).

But yeah, seems fine to me.

nalimilan commented 4 years ago

Cool. Feel free to make a PR @jameslefevre.

nalimilan commented 3 years ago

To clarify what this addition would entail: it would mean that middle on Dates chooses the "middle" date allowing for some rounding. This would differ from date arithmetic, where e.g. x + (y-x)/2 fails if the result isn't an integer instead of returning a rounded result.

This may be OK though and would kind of justify the existence of middle. We would also need a more general version of middle which would be used by quantile, taking an argument γ and returning a rounded x + γ*(y-x) as in https://github.com/JuliaLang/Statistics.jl/blob/74897fed33700dba92578aa0fefef5b99ba16086/src/Statistics.jl#L1006

jariji commented 1 year ago

Or would it make sense to define middle in terms of x + (y-x)/2?

That makes most sense to me. AFAICT it's equivalent to the current definition (x + y) / 2 except it generalizes to affine spaces like dates and Fahrenheits.


I'm also not really convinced rounding is the right thing. How do we know whether to round up or down? InexactError would be safer so the user can be intentional about the decision when it's unclear what they meant, and that's what will happen automatically using the x + (y-x)/2 implementation. I'd propose middle(today, tomorrow) error and middle(yesterday, tomorrow) == today.

This will also apply for middle('a', 'b'), which is ambiguous.

nalimilan commented 1 year ago

x + (y-x)/2 probably makes sense in general, but for Date and DateTime without rounding an error will be thrown most of the time so I'm not sure it would be useful for that use case. Maybe it would be useful for other cases though.