JuliaData / DataTables.jl

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable
Other
29 stars 11 forks source link

Stop auto-promoting column-types #30

Closed cjprybol closed 7 years ago

cjprybol commented 7 years ago

I was hitting issues during the rebase and thought it would be smarter to just branch from master, dump the edits in, and push to overwrite the old branch. Looks like that was a bad idea and the prior PR was automatically closed.

To summarize some of the features discussed earlier and still present in this PR:

https://github.com/JuliaData/DataTables.jl/pull/24

cjprybol commented 7 years ago

New, potentially disruptive (but probably not?) change I have added to this. stack/unstack & meltdt/unstackdt were all performing lots of Array type conversions, the first to NullableCategoricalArrays/NullableArrays and the later to the internal AbstractVector types of RepeatedArray/StackedArray. Removing the automatic array type promotion of stack/unstack wasn't too bad, but removing it from meltdt/unstackdt is essentially impossible without a rewrite of how those work (and the custom backend array types). RepeatedArrays and StackedArrays don't support all indexing types (Bool/BitArray were the two I cared about) and also, they are both another case of auto promotion of Array type that we're trying to get rid of. They don't appear to be commonly used (and are definitely not well tested), otherwise I'm sure someone would have caught these basic usability issues already.

nalimilan commented 7 years ago

I think that behavior is fine. That's the only way to avoid unnecessary copies. People can just make a copy if needed.

nalimilan commented 7 years ago

Probably not worth it in practice, but who knows...

nalimilan commented 7 years ago

Ah, too bad. I'd rather keep the old form then. Anyway iterating over keyword arguments should be quite cheap.

nalimilan commented 7 years ago

Ah, that's really annoying. This could be a sufficient motivation to add temporary hacks to support concatenation up to 3 or 4 arguments as in https://github.com/JuliaStats/NullableArrays.jl/pull/187.

cjprybol commented 7 years ago

Ok, so the backstory behind https://github.com/JuliaData/DataTables.jl/issues/34 was I had only modified the right data table of the compose_joined_table function to handle nulls in order to get all tests to pass, which seemed odd and probably incorrect. Why should we have to modify only one side to handle missing data? I was stepping through the join process to check and see if the left data table was being promoted elsewhere, and found that no, the left data table wasn't being promoted elsewhere, we just didn't have any tests that would introduce missing data to the left half of joined table. So I wrote a full combination of joins as tests, wrote out the expected answers, and caught all of these issues at the same time: The ordering was strange (but now seems correct as it matched pandas and @nalimilan mentioned there's no guarantee for ordering in SQL), nulls were not being handled correctly, and both of these issues were occurring within a few lines of each other in the compose_joined_table. Hopefully following along with these reviews hasn't been too much of a nightmare.

I've added tests to assert the joining behavior for all joins and the right join is fixed by a hacky recognition that because we order the joined table by the left table row-order and then the right table row-order, we could get the right values into the table by just writing to the last rows of the table in https://github.com/JuliaData/DataTables.jl/pull/30/files#diff-b1b6de17647445a5c08f65fd851680caR96.

@alyst are you still familiar enough with this join code from our recent PR to know why rightonly_ixs.join isn't filled with the correct values for right joins? I see here that the right join logic is the same as the left join logic but the inputs are sent to update_row_maps! in reverse order (right -> left) and then flipped before passing to compose_joined_table. I'm not sure if something is getting lost in translation during that flip.

alyst commented 7 years ago

Regarding the joined table rows ordering. The intention of the original PR was to make joins preserve the ordering of the left table as much as possible (at that time DataFrames/master ordered the result by "on" columns-induced index). If the order of the left table was not preserved, then it's a bug. Unfortunately, I don't know what is the behaviour of pandas, but neither I understand why DataTables should reproduce pandas behaviour and not, e.g., dplyr. The preservation of the left table rows order looks very natural, doesn't cost much performance and is much more tolerant to the changes the users introduce into their script. (The typical scenario. Suppose you want to add the names of the employees by joining the employees table via the "employee_id" to the payroll report that is already sorted by the department and the salary. You don't want the updated report to be sorted by employee IDs and you don't want to sort it once more. Also you don't want to change your script at multiple locations when the sorting criteria are changed.).

SQL databases are the entirely different league. The tables are stored not in the same way as SELECT query results; SQL analyzer optimizes the queries, and the intermediate "join" results might not even exist. So imposing restrictions on the rows order would make things complicated and may introduce serious performance regressions. OTOH SQL "ORDER BY" benefits from the persistent multi-column indices (and if certain index does not exist, the "ORDER BY" performance may be abysmal). I suppose it is out of DataTables scope to implement optimized multi-joins and advanced indexing.

As for the NULL values and left/right types compatibility; unfortunately, at the time of the original PR my main goal was to work around very poor PooledDataArray performance, hence these strange manipulations with left/right indices. #17 doesn't work well with null/non-null columns. In particular, some right joins that are totally valid, would generate an error, because the compose_joined_table() would first try to fill the "on" columns using the not-completely-matching left table rows. I would like to take a look at it, but ATM I don't have a second test/dev system, and it's not very easy to temporary switch the working Julia configuration from DataFrames to DataTables.

cjprybol commented 7 years ago

Thanks for the explanation @alyst! Much of this confusion comes from me not having much experience with other languages. I'm pretty terrible with R so I'm usually unable to reproduce Julia examples in R for cross-referencing, and I don't think I can recall ever writing a SQL query in my life. I can write python code, so I compared this all against pandas, and of course the prior implementation had comments to say it was written while using pandas code as an example. Your comment of

Unfortunately, I don't know what is the behaviour of pandas, but neither I understand why DataTables should reproduce pandas behaviour and not, e.g., dplyr.

is completely valid. It's more of a "I'm not competent with other data analysis languages" than a favoritism for Python/Pandas issue. Thanks for your comments here and help on #17! I guess we'll move forward with this patch for now.

cjprybol commented 7 years ago

@nalimilan if we stick with the current unstack methods they still need some work. We have issues with type stability and ordering, both dependent on the inputs. Here are some examples documented with the desired return types for columns. I believe all 4 of these examples are supposed to have the same output, although the A and B variants will have different column types. The 4 variants each return a different order and column type.

julia> using DataTables, Base.Test
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> dtA = DataTable(Fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"]),
                       Key = ["Mass", "Color", "Mass", "Color"],
                       Value = ["12 g", "Red", "18 g", "Grey"])
4×3 DataTables.DataTable
│ Row │ Fish     │ Key   │ Value │
├─────┼──────────┼───────┼───────┤
│ 1   │ "Bob"    │ Mass  │ 12 g  │
│ 2   │ "Bob"    │ Color │ Red   │
│ 3   │ "Batman" │ Mass  │ 18 g  │
│ 4   │ "Batman" │ Color │ Grey  │

julia> # Check that reordering levels does not confuse unstack
       levels!(dtA[1], ["XXX", "Bob", "Batman"])
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> dtB = DataTable(Fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"]),
                       Key = CategoricalArray(["Mass", "Color", "Mass", "Color"]),
                       Value = CategoricalArray(["12 g", "Red", "18 g", "Grey"]))
4×3 DataTables.DataTable
│ Row │ Fish     │ Key     │ Value  │
├─────┼──────────┼─────────┼────────┤
│ 1   │ "Bob"    │ "Mass"  │ "12 g" │
│ 2   │ "Bob"    │ "Color" │ "Red"  │
│ 3   │ "Batman" │ "Mass"  │ "18 g" │
│ 4   │ "Batman" │ "Color" │ "Grey" │

julia> dt2A = unstack(dtA, :Fish, :Key, :Value)
3×3 DataTables.DataTable
│ Row │ Fish   │ Color │ Mass  │
├─────┼────────┼───────┼───────┤
│ 1   │ XXX    │ #NULL │ #NULL │
│ 2   │ Bob    │ Red   │ 12 g  │
│ 3   │ Batman │ Grey  │ 18 g  │

julia> # should be NullableCategoricalArray, Nullable, Nullable
       eltypes(dt2A)
3-element Array{Type,1}:
 Nullable{String}
 Nullable{String}
 Nullable{String}

julia> dt2B = unstack(dtB, :Fish, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish   │ Color  │ Mass   │
├─────┼────────┼────────┼────────┤
│ 1   │ Batman │ "Grey" │ "18 g" │
│ 2   │ Bob    │ "Red"  │ "12 g" │

julia> # should be NullableCategoricalArray, NullableCategoricalArray, NullableCategoricalArray
       eltypes(dt2B)
3-element Array{Type,1}:
 Nullable{String}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}

julia> dt3A = unstack(dtA, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish     │ Color │ Mass │
├─────┼──────────┼───────┼──────┤
│ 1   │ "Bob"    │ Red   │ 12 g │
│ 2   │ "Batman" │ Grey  │ 18 g │

julia> # should be NullableCategoricalArray, Nullable, Nullable
       eltypes(dt3A)
3-element Array{Type,1}:
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{String}
 Nullable{String}

julia> dt3B = unstack(dtB, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish     │ Color  │ Mass   │
├─────┼──────────┼────────┼────────┤
│ 1   │ "Batman" │ "Grey" │ "18 g" │
│ 2   │ "Bob"    │ "Red"  │ "12 g" │

julia> # should be NullableCategoricalArray, NullableCategoricalArray, NullableCategoricalArray
       eltypes(dt3B)
3-element Array{Type,1}:
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}

julia> dt = DataTable(id=1:2, variable=["a", "b"], value=3:4)
2×3 DataTables.DataTable
│ Row │ id │ variable │ value │
├─────┼────┼──────────┼───────┤
│ 1   │ 1  │ a        │ 3     │
│ 2   │ 2  │ b        │ 4     │

julia> a = unstack(dt, :id, :variable, :value)
2×3 DataTables.DataTable
│ Row │ id │ a     │ b     │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 3     │ #NULL │
│ 2   │ 2  │ #NULL │ 4     │

julia> eltypes(a)
3-element Array{Type,1}:
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Int64}

julia> b = unstack(dt, :variable, :value)
2×3 DataTables.DataTable
│ Row │ id │ a     │ b     │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 3     │ #NULL │
│ 2   │ 2  │ #NULL │ 4     │

julia> eltypes(b)
3-element Array{Type,1}:
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Int64}

Part of the issue is with the NullableCategoricalArray constructor, it doesn't respect the ordering of levels when taking in a CategoricalArray or a normal Array. Take this example from the tests above. If we stick with the current unstack then we'll have to manually reset the levels on each constructed NullableCategoricalArray so the ordering doesn't continue change.

julia> using NullableArrays, CategoricalArrays
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels!(fish, ["XXX", "Bob", "Batman"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(fish)
3-element Array{String,1}:
 "XXX"
 "Bob"
 "Batman"

julia> # taking the first few steps through unstack
       nullfish = NullableCategoricalArray(fish)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish) # why did the levels change?
3-element Array{String,1}:
 "Batman"
 "Bob"
 "XXX"

julia> nullfish = NullableCategoricalArray(fish, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish) # still reordered even with ordered=true
3-element Array{String,1}:
 "Batman"
 "Bob"
 "XXX"

julia> droplevels!(nullfish)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish)
2-element Array{String,1}:
 "Batman"
 "Bob"
cjprybol commented 7 years ago

new vcat errors

julia> using DataTables, Base.Test
INFO: Recompiling stale cache file /Users/Cameron/.julia/lib/v0.6/DataTables.ji for module DataTables.
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

WARNING: deprecated syntax "inner constructor DataTableJoiner(...) around /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:27".
Use "DataTableJoiner{DT1,DT2}(...) where {DT1,DT2}" instead.
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> dt1 = DataTable(A = 1, B = 1)
1×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │

julia> vcat(dt1, dt1)
2×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │
│ 2   │ 1 │ 1 │

julia> dt2 = DataTable(B = 1, A = 1)
1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │

julia> vcat(dt1, dt2)
ERROR: ArgumentError: column order of argument(s) (1) != column order of argument(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765

julia> vcat(dt1, dt1, dt1, dt1, dt2, dt2, dt2, dt2)
ERROR: ArgumentError: column order of argument(s) (1, 2, 3, 4) != column order of argument(s) (5, 6, 7, 8)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable, ::DataTables.DataTable, ::Vararg{DataTables.DataTable,N} where N) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765

julia> vcat(dt2, dt2)
2×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │
│ 2   │ 1 │ 1 │

julia> dt3 = DataTable(A = 1, B = 1, C = 1)
1×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1   │ 1 │ 1 │ 1 │

julia> vcat(dt1, dt3)
ERROR: ArgumentError: column(s) (C) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt2, dt3)
ERROR: ArgumentError: column(s) (C) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt4 = DataTable(A = 1, B = 1, C = 1, D = 1)
1×4 DataTables.DataTable
│ Row │ A │ B │ C │ D │
├─────┼───┼───┼───┼───┤
│ 1   │ 1 │ 1 │ 1 │ 1 │

julia> vcat(dt1, dt4)
ERROR: ArgumentError: column(s) (C, D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt2, dt4)
ERROR: ArgumentError: column(s) (C, D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt3, dt4)
ERROR: ArgumentError: column(s) (D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt5 = hcat(dt4, dt4, dt4, dt4)
1×16 DataTables.DataTable
│ Row │ A │ B │ C │ D │ A_1 │ B_1 │ C_1 │ D_1 │ A_2 │ B_2 │ C_2 │ D_2 │ A_3 │ B_3 │ C_3 │ D_3 │
├─────┼───┼───┼───┼───┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ 1   │ 1 │ 1 │ 1 │ 1 │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │

julia> vcat(dt3, dt5)
ERROR: ArgumentError: column(s) (D, A_1, B_1, C_1, D_1, A_2, B_2, C_2, D_2, A_3, B_3, C_3, D_3) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt5r = names!(copy(dt5), reverse(names(dt5)))
1×16 DataTables.DataTable
│ Row │ D_3 │ C_3 │ B_3 │ A_3 │ D_2 │ C_2 │ B_2 │ A_2 │ D_1 │ C_1 │ B_1 │ A_1 │ D │ C │ B │ A │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼───┼───┼───┼───┤
│ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1 │ 1 │ 1 │ 1 │

julia> vcat(dt5, dt5r)
ERROR: ArgumentError: column order of argument(s) (1) != column order of argument(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765
nalimilan commented 7 years ago

Good catch about stack/unstack, but AFAICT that's unrelated to this PR. Please transform your comment into an issue so that we keep track of this.

cjprybol commented 7 years ago

Fixing the ordering isn't necessarily required by this PR, but with the PR title of "Stop auto-promoting column-types", I think the other unstack changes are justified as unstack will now preserve proper column types? This is why I had tried to re-write unstack in the first place.

edit: And if we don't fix the ordering then I'll need to duplicate the tests for unstack. I'd like to keep the ordering...

cjprybol commented 7 years ago

pending no major issues during reviews, I think this is ready to merge.

cjprybol commented 7 years ago

That sounds like a great plan, @nalimilan, I can definitely do that!

I think the logical units present in this PR that are necessary to stop autopromotion or otherwise assist with the transition away from autopromotion are:

I agree with all of your specific code comments/reviews as well, so I'll incorporate those where appropriate. Thanks for the feedback, and please don't feel the need to apologize for any delays! I'm sure you're busy enough before you start volunteering your time to maintain and develop this package, and DataTables isn't even the only package you maintain and develop. @nalimilan and @ararslan both deserve 🥇🥇s, as do the maintainers of many other packages and the language itself.

I'll leave this PR open as a reference, and now that we're going to refactor this, I'll update https://github.com/JuliaData/DataTables.jl/pull/31 to not be dependant on this PR and try and get those changes merged soon too.