Stop auto-promoting column-types

I was hitting issues during the rebase and thought it would be smarter to just branch from master, dump the edits in, and push to overwrite the old branch. Looks like that was a bad idea and the prior PR was automatically closed.

To summarize some of the features discussed earlier and still present in this PR:

Automatic promotion of columns to nullable arrays no longer occurs
all constructors will try and recycle inputs as a byproduct of consolidating the constructors
new functions nullify!, nullify, denullify!, and denullify have been added for converting DataTables back and forth from nullable and nullfree states.

https://github.com/JuliaData/DataTables.jl/pull/24

New, potentially disruptive (but probably not?) change I have added to this. stack/unstack & meltdt/unstackdt were all performing lots of Array type conversions, the first to NullableCategoricalArrays/NullableArrays and the later to the internal AbstractVector types of RepeatedArray/StackedArray. Removing the automatic array type promotion of stack/unstack wasn't too bad, but removing it from meltdt/unstackdt is essentially impossible without a rewrite of how those work (and the custom backend array types). RepeatedArrays and StackedArrays don't support all indexing types (Bool/BitArray were the two I cared about) and also, they are both another case of auto promotion of Array type that we're trying to get rid of. They don't appear to be commonly used (and are definitely not well tested), otherwise I'm sure someone would have caught these basic usability issues already.

I think that behavior is fine. That's the only way to avoid unnecessary copies. People can just make a copy if needed.

Probably not worth it in practice, but who knows...

Ah, too bad. I'd rather keep the old form then. Anyway iterating over keyword arguments should be quite cheap.

Ah, that's really annoying. This could be a sufficient motivation to add temporary hacks to support concatenation up to 3 or 4 arguments as in https://github.com/JuliaStats/NullableArrays.jl/pull/187.

Ok, so the backstory behind https://github.com/JuliaData/DataTables.jl/issues/34 was I had only modified the right data table of the compose_joined_table function to handle nulls in order to get all tests to pass, which seemed odd and probably incorrect. Why should we have to modify only one side to handle missing data? I was stepping through the join process to check and see if the left data table was being promoted elsewhere, and found that no, the left data table wasn't being promoted elsewhere, we just didn't have any tests that would introduce missing data to the left half of joined table. So I wrote a full combination of joins as tests, wrote out the expected answers, and caught all of these issues at the same time: The ordering was strange (but now seems correct as it matched pandas and @nalimilan mentioned there's no guarantee for ordering in SQL), nulls were not being handled correctly, and both of these issues were occurring within a few lines of each other in the compose_joined_table. Hopefully following along with these reviews hasn't been too much of a nightmare.

I've added tests to assert the joining behavior for all joins and the right join is fixed by a hacky recognition that because we order the joined table by the left table row-order and then the right table row-order, we could get the right values into the table by just writing to the last rows of the table in https://github.com/JuliaData/DataTables.jl/pull/30/files#diff-b1b6de17647445a5c08f65fd851680caR96.

@alyst are you still familiar enough with this join code from our recent PR to know why rightonly_ixs.join isn't filled with the correct values for right joins? I see here that the right join logic is the same as the left join logic but the inputs are sent to update_row_maps! in reverse order (right -> left) and then flipped before passing to compose_joined_table. I'm not sure if something is getting lost in translation during that flip.

Regarding the joined table rows ordering. The intention of the original PR was to make joins preserve the ordering of the left table as much as possible (at that time DataFrames/master ordered the result by "on" columns-induced index). If the order of the left table was not preserved, then it's a bug. Unfortunately, I don't know what is the behaviour of pandas, but neither I understand why DataTables should reproduce pandas behaviour and not, e.g., dplyr. The preservation of the left table rows order looks very natural, doesn't cost much performance and is much more tolerant to the changes the users introduce into their script. (The typical scenario. Suppose you want to add the names of the employees by joining the employees table via the "employee_id" to the payroll report that is already sorted by the department and the salary. You don't want the updated report to be sorted by employee IDs and you don't want to sort it once more. Also you don't want to change your script at multiple locations when the sorting criteria are changed.).

SQL databases are the entirely different league. The tables are stored not in the same way as SELECT query results; SQL analyzer optimizes the queries, and the intermediate "join" results might not even exist. So imposing restrictions on the rows order would make things complicated and may introduce serious performance regressions. OTOH SQL "ORDER BY" benefits from the persistent multi-column indices (and if certain index does not exist, the "ORDER BY" performance may be abysmal). I suppose it is out of DataTables scope to implement optimized multi-joins and advanced indexing.

As for the NULL values and left/right types compatibility; unfortunately, at the time of the original PR my main goal was to work around very poor PooledDataArray performance, hence these strange manipulations with left/right indices. #17 doesn't work well with null/non-null columns. In particular, some right joins that are totally valid, would generate an error, because the compose_joined_table() would first try to fill the "on" columns using the not-completely-matching left table rows. I would like to take a look at it, but ATM I don't have a second test/dev system, and it's not very easy to temporary switch the working Julia configuration from DataFrames to DataTables.

Thanks for the explanation @alyst! Much of this confusion comes from me not having much experience with other languages. I'm pretty terrible with R so I'm usually unable to reproduce Julia examples in R for cross-referencing, and I don't think I can recall ever writing a SQL query in my life. I can write python code, so I compared this all against pandas, and of course the prior implementation had comments to say it was written while using pandas code as an example. Your comment of

Unfortunately, I don't know what is the behaviour of pandas, but neither I understand why DataTables should reproduce pandas behaviour and not, e.g., dplyr.

is completely valid. It's more of a "I'm not competent with other data analysis languages" than a favoritism for Python/Pandas issue. Thanks for your comments here and help on #17! I guess we'll move forward with this patch for now.

@nalimilan if we stick with the current unstack methods they still need some work. We have issues with type stability and ordering, both dependent on the inputs. Here are some examples documented with the desired return types for columns. I believe all 4 of these examples are supposed to have the same output, although the A and B variants will have different column types. The 4 variants each return a different order and column type.

julia> using DataTables, Base.Test
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> dtA = DataTable(Fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"]),
                       Key = ["Mass", "Color", "Mass", "Color"],
                       Value = ["12 g", "Red", "18 g", "Grey"])
4×3 DataTables.DataTable
│ Row │ Fish     │ Key   │ Value │
├─────┼──────────┼───────┼───────┤
│ 1   │ "Bob"    │ Mass  │ 12 g  │
│ 2   │ "Bob"    │ Color │ Red   │
│ 3   │ "Batman" │ Mass  │ 18 g  │
│ 4   │ "Batman" │ Color │ Grey  │

julia> # Check that reordering levels does not confuse unstack
       levels!(dtA[1], ["XXX", "Bob", "Batman"])
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> dtB = DataTable(Fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"]),
                       Key = CategoricalArray(["Mass", "Color", "Mass", "Color"]),
                       Value = CategoricalArray(["12 g", "Red", "18 g", "Grey"]))
4×3 DataTables.DataTable
│ Row │ Fish     │ Key     │ Value  │
├─────┼──────────┼─────────┼────────┤
│ 1   │ "Bob"    │ "Mass"  │ "12 g" │
│ 2   │ "Bob"    │ "Color" │ "Red"  │
│ 3   │ "Batman" │ "Mass"  │ "18 g" │
│ 4   │ "Batman" │ "Color" │ "Grey" │

julia> dt2A = unstack(dtA, :Fish, :Key, :Value)
3×3 DataTables.DataTable
│ Row │ Fish   │ Color │ Mass  │
├─────┼────────┼───────┼───────┤
│ 1   │ XXX    │ #NULL │ #NULL │
│ 2   │ Bob    │ Red   │ 12 g  │
│ 3   │ Batman │ Grey  │ 18 g  │

julia> # should be NullableCategoricalArray, Nullable, Nullable
       eltypes(dt2A)
3-element Array{Type,1}:
 Nullable{String}
 Nullable{String}
 Nullable{String}

julia> dt2B = unstack(dtB, :Fish, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish   │ Color  │ Mass   │
├─────┼────────┼────────┼────────┤
│ 1   │ Batman │ "Grey" │ "18 g" │
│ 2   │ Bob    │ "Red"  │ "12 g" │

julia> # should be NullableCategoricalArray, NullableCategoricalArray, NullableCategoricalArray
       eltypes(dt2B)
3-element Array{Type,1}:
 Nullable{String}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}

julia> dt3A = unstack(dtA, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish     │ Color │ Mass │
├─────┼──────────┼───────┼──────┤
│ 1   │ "Bob"    │ Red   │ 12 g │
│ 2   │ "Batman" │ Grey  │ 18 g │

julia> # should be NullableCategoricalArray, Nullable, Nullable
       eltypes(dt3A)
3-element Array{Type,1}:
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{String}
 Nullable{String}

julia> dt3B = unstack(dtB, :Key, :Value)
2×3 DataTables.DataTable
│ Row │ Fish     │ Color  │ Mass   │
├─────┼──────────┼────────┼────────┤
│ 1   │ "Batman" │ "Grey" │ "18 g" │
│ 2   │ "Bob"    │ "Red"  │ "12 g" │

julia> # should be NullableCategoricalArray, NullableCategoricalArray, NullableCategoricalArray
       eltypes(dt3B)
3-element Array{Type,1}:
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}
 Nullable{CategoricalArrays.CategoricalValue{String,UInt32}}

julia> dt = DataTable(id=1:2, variable=["a", "b"], value=3:4)
2×3 DataTables.DataTable
│ Row │ id │ variable │ value │
├─────┼────┼──────────┼───────┤
│ 1   │ 1  │ a        │ 3     │
│ 2   │ 2  │ b        │ 4     │

julia> a = unstack(dt, :id, :variable, :value)
2×3 DataTables.DataTable
│ Row │ id │ a     │ b     │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 3     │ #NULL │
│ 2   │ 2  │ #NULL │ 4     │

julia> eltypes(a)
3-element Array{Type,1}:
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Int64}

julia> b = unstack(dt, :variable, :value)
2×3 DataTables.DataTable
│ Row │ id │ a     │ b     │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 3     │ #NULL │
│ 2   │ 2  │ #NULL │ 4     │

julia> eltypes(b)
3-element Array{Type,1}:
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Int64}

Part of the issue is with the NullableCategoricalArray constructor, it doesn't respect the ordering of levels when taking in a CategoricalArray or a normal Array. Take this example from the tests above. If we stick with the current unstack then we'll have to manually reset the levels on each constructed NullableCategoricalArray so the ordering doesn't continue change.

julia> using NullableArrays, CategoricalArrays
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> fish = CategoricalArray(["Bob", "Bob", "Batman", "Batman"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels!(fish, ["XXX", "Bob", "Batman"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(fish)
3-element Array{String,1}:
 "XXX"
 "Bob"
 "Batman"

julia> # taking the first few steps through unstack
       nullfish = NullableCategoricalArray(fish)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish) # why did the levels change?
3-element Array{String,1}:
 "Batman"
 "Bob"
 "XXX"

julia> nullfish = NullableCategoricalArray(fish, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish) # still reordered even with ordered=true
3-element Array{String,1}:
 "Batman"
 "Bob"
 "XXX"

julia> droplevels!(nullfish)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "Bob"
 "Bob"
 "Batman"
 "Batman"

julia> levels(nullfish)
2-element Array{String,1}:
 "Batman"
 "Bob"

new vcat errors

julia> using DataTables, Base.Test
INFO: Recompiling stale cache file /Users/Cameron/.julia/lib/v0.6/DataTables.ji for module DataTables.
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

WARNING: deprecated syntax "inner constructor DataTableJoiner(...) around /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:27".
Use "DataTableJoiner{DT1,DT2}(...) where {DT1,DT2}" instead.
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> dt1 = DataTable(A = 1, B = 1)
1×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │

julia> vcat(dt1, dt1)
2×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │
│ 2   │ 1 │ 1 │

julia> dt2 = DataTable(B = 1, A = 1)
1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │

julia> vcat(dt1, dt2)
ERROR: ArgumentError: column order of argument(s) (1) != column order of argument(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765

julia> vcat(dt1, dt1, dt1, dt1, dt2, dt2, dt2, dt2)
ERROR: ArgumentError: column order of argument(s) (1, 2, 3, 4) != column order of argument(s) (5, 6, 7, 8)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable, ::DataTables.DataTable, ::Vararg{DataTables.DataTable,N} where N) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765

julia> vcat(dt2, dt2)
2×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 1   │ 1 │ 1 │
│ 2   │ 1 │ 1 │

julia> dt3 = DataTable(A = 1, B = 1, C = 1)
1×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1   │ 1 │ 1 │ 1 │

julia> vcat(dt1, dt3)
ERROR: ArgumentError: column(s) (C) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt2, dt3)
ERROR: ArgumentError: column(s) (C) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt4 = DataTable(A = 1, B = 1, C = 1, D = 1)
1×4 DataTables.DataTable
│ Row │ A │ B │ C │ D │
├─────┼───┼───┼───┼───┤
│ 1   │ 1 │ 1 │ 1 │ 1 │

julia> vcat(dt1, dt4)
ERROR: ArgumentError: column(s) (C, D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt2, dt4)
ERROR: ArgumentError: column(s) (C, D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> vcat(dt3, dt4)
ERROR: ArgumentError: column(s) (D) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt5 = hcat(dt4, dt4, dt4, dt4)
1×16 DataTables.DataTable
│ Row │ A │ B │ C │ D │ A_1 │ B_1 │ C_1 │ D_1 │ A_2 │ B_2 │ C_2 │ D_2 │ A_3 │ B_3 │ C_3 │ D_3 │
├─────┼───┼───┼───┼───┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ 1   │ 1 │ 1 │ 1 │ 1 │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │

julia> vcat(dt3, dt5)
ERROR: ArgumentError: column(s) (D, A_1, B_1, C_1, D_1, A_2, B_2, C_2, D_2, A_3, B_3, C_3, D_3) are missing from argument(s) (1)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:757

julia> dt5r = names!(copy(dt5), reverse(names(dt5)))
1×16 DataTables.DataTable
│ Row │ D_3 │ C_3 │ B_3 │ A_3 │ D_2 │ C_2 │ B_2 │ A_2 │ D_1 │ C_1 │ B_1 │ A_1 │ D │ C │ B │ A │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼───┼───┼───┼───┤
│ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1   │ 1 │ 1 │ 1 │ 1 │

julia> vcat(dt5, dt5r)
ERROR: ArgumentError: column order of argument(s) (1) != column order of argument(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:765

Good catch about stack/unstack, but AFAICT that's unrelated to this PR. Please transform your comment into an issue so that we keep track of this.

Fixing the ordering isn't necessarily required by this PR, but with the PR title of "Stop auto-promoting column-types", I think the other unstack changes are justified as unstack will now preserve proper column types? This is why I had tried to re-write unstack in the first place.

edit: And if we don't fix the ordering then I'll need to duplicate the tests for unstack. I'd like to keep the ordering...

~CategoricalArrays issue has been filed https://github.com/JuliaData/CategoricalArrays.jl/issues/62~ fixed
vcat errors are extensively tested, both for error type and error message
joins are fully type stabilized
unstack is type stabilized ~and order stabilized except for when columns are NullableCategoricalArrays, but that will fix itself when https://github.com/JuliaData/CategoricalArrays.jl/issues/62 is fixed (and we'll know it's fixed because some of the unstack tests here will fail, and it will be a rare moment when we are excited for failures!)~
everything else unrelated has been reverted (except for incorrect spacing and compat removals, which I see no reason to revert).
I don't see any more issues when reviewing the diff by eye.
Let's ignore the coverage drop? I've added many tests here and caught dozens of behavioral bugs that are now corrected. We're also still up 2% on DataFrames and I'll increase coverage a bit more with https://github.com/JuliaData/DataTables.jl/pull/31.
constructors are all consolidated in behavior and return the correct column type, and recycle properly.
right join errors are patched

pending no major issues during reviews, I think this is ready to merge.

That sounds like a great plan, @nalimilan, I can definitely do that!

I think the logical units present in this PR that are necessary to stop autopromotion or otherwise assist with the transition away from autopromotion are:

[ ] converters - nullify, nullify!, denullify, denullify!
[x] join - add similar_nullable function for NullableCategoricalArrays, type-stabilize joins, always return fully nullable datatables for outer joins, and patch right outer join bug https://github.com/JuliaData/DataTables.jl/pull/44
[x] vcat - update vcat promotion rules, remove assumptions on how to concatenate columns that are out of order or only present in some of the tables in favor of stricter behavior that mirrors Base.vcat with detailed error messages. (use the proposed @generated function to handle the cases where Base.vcat fails when mixing categorical arrays and non categorical arrays) https://github.com/JuliaData/DataTables.jl/pull/45
[x] io - minor updates to abstractdatatable/io.jl (indentation and nullable handling) https://github.com/JuliaData/DataTables.jl/pull/42
[ ] hcat - condense that group of hcat functions to the single function needed and fix that one untested hcat that has paramters dt, dt2 but calls dt1, dt2 https://github.com/JuliaData/DataTables.jl/pull/43 & https://github.com/JuliaData/DataTables.jl/pull/31
[ ] reshape w/ copies - type stabilize stack/unstack https://github.com/JuliaData/DataTables.jl/pull/54
[ ] reshape w/ views - update indexing in stackdt/meltdt to handle all expected indexing strategies (and explore whether reusing existing options like subarrays/arrayviews would be helpful for this or not. It might be easier to just add boolean and bitarray indexing than to swap out the backend)
[ ] reshape bonus? - choose either melt or unstack as the name we want and deprecate the other?
[x] DataTable constructor - and last but not least, consolidate data tables constructors, remove autopromotion of values, perform recycling, and extend tests to cover nullable arrays, null free arrays, and mixtures of the two https://github.com/JuliaData/DataTables.jl/pull/53

I agree with all of your specific code comments/reviews as well, so I'll incorporate those where appropriate. Thanks for the feedback, and please don't feel the need to apologize for any delays! I'm sure you're busy enough before you start volunteering your time to maintain and develop this package, and DataTables isn't even the only package you maintain and develop. @nalimilan and @ararslan both deserve 🥇🥇s, as do the maintainers of many other packages and the language itself.

I'll leave this PR open as a reference, and now that we're going to refactor this, I'll update https://github.com/JuliaData/DataTables.jl/pull/31 to not be dependant on this PR and try and get those changes merged soon too.

JuliaData / DataTables.jl

Stop auto-promoting column-types #30