JuliaData / DataTables.jl

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable
Other
29 stars 11 forks source link

`==` does not compare columns of `ZonedDateTime`s correctly #84

Open spurll opened 7 years ago

spurll commented 7 years ago

Comparing two ZonedDateTimes that represent the same "instant" (but in different time zones) with == returns true, but comparing them with isequal returns false.

julia> using TimeZones, DataFrames, DataTables

julia> ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")) == ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC"))
true

julia> isequal(ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")))
false

DataFrames.jl maintains this convention:

julia> using TimeZones, DataFrames

julia> df_1 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T00:00:00-06:00 │
│ 2   │ 2  │ 2016-01-01T01:00:00-06:00 │

julia> df_2 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T06:00:00+00:00 │
│ 2   │ 2  │ 2016-01-01T07:00:00+00:00 │

julia> df_1 == df_2
true

julia> isequal(df_1, df_2)
false

...but DataTables.jl doesn't:

julia> using TimeZones, DataTables

julia> dt_1 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataTables.DataTable
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T00:00:00-06:00 │
│ 2   │ 2  │ 2016-01-01T01:00:00-06:00 │

julia> dt_2 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataTables.DataTable
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T06:00:00+00:00 │
│ 2   │ 2  │ 2016-01-01T07:00:00+00:00 │

julia> dt_1 == dt_2
false

It's no real mystery why, given the fairly terse definition of ==:

@compat(Base.:(==))(dt1::AbstractDataTable, dt2::AbstractDataTable) = isequal(dt1, dt2)

I think that supporting == comparisons (rather than just doing isequals all the way down) would be preferable in this case.

Version information:

julia> Pkg.status("DataTables")
 - DataTables                    0.0.3

julia> versioninfo()
Julia Version 0.6.0-rc3.0
Commit ad290e93e4* (2017-06-07 11:53 UTC)
ararslan commented 7 years ago

Yeah definitely. isqual and == are separate functions in Base for a reason.

spurll commented 7 years ago

I'm sure I can get a PR in for this in fairly short order.

ararslan commented 7 years ago

That would be fantastic. Thanks!

spurll commented 7 years ago

Well, I figured out why DataTables.jl has just been using isequal.

This is actually more complex to solve than I initially anticipated, owing to the fact that == checks between NullableArrays are broken.

julia> using NullableArrays

julia> a = NullableArray(1:3)
3-element NullableArrays.NullableArray{Int64,1}:
 1
 2
 3

julia> b = NullableArray(1:3)
3-element NullableArrays.NullableArray{Int64,1}:
 1
 2
 3

julia> a == b
ERROR: TypeError: non-boolean (Nullable{Bool}) used in boolean context
Stacktrace:
 [1] ==(::NullableArrays.NullableArray{Int64,1}, ::NullableArrays.NullableArray{Int64,1}) at ./abstractarray.jl:1527

This, in turn, is because == comparisons between Nullables return Nullable{Bool}, rather than Bool.

In my opinion, the best fix for this would be to provide == for NullableArrays and work with that. There was a PR to fix this in 2015, but it was never merged: https://github.com/JuliaStats/NullableArrays.jl/pull/84

I think I'm going to go ahead and take a shot at a PR here, but I suspect it isn't going to be pretty.

nalimilan commented 7 years ago

The problem is with Nullable, and fixing it in NullableArrays would require type piracy. == with NullableArray is kinda forced to be inconsistent or not to work at all because == throws an errror for Nullable in Base. The solution to this will be to move either to Union{T, Null} (in DataFrames) or to DataValue{T}.

davidanthoff commented 7 years ago

I think the definition for == in DataValues.jl is ok at this point. I'm also in the process of adding a DataValueArray that also fixes this, and then I'm also going to have a DataValueTable that is based on that. Essentially that will be exactly the same design as the current DataTable approach, except it will use DataValue instead of Nullable to get around the restrictions that we have due to Nullable being in base and Nullable not being special cased for the data science stack. I'm optimistic that I should be able to release soon, but on the flipside, classes start tomorrow, so who knows :)

spurll commented 7 years ago

Makes sense to me. I've made changes to my code that's working with DataTables to work around this issue for the moment, and I won't spend time trying to make == work as expected (at least until Nullables behave themselves a little better). Thanks, folks!