JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
147 stars 25 forks source link

`Table()` throws - implement zero column tables #55

Open tkf opened 4 years ago

tkf commented 4 years ago

I tried to create an empty table by Table() but it failed at

https://github.com/JuliaData/TypedTables.jl/blob/fc1247a590be301341d172d447f706347bed727e/src/TypedTables.jl#L23

because n is undefined.

My motivation is to use this with push!! so that

push!!(Table(), (a=1, b=2)) == Table(a=[1], b=[2])

This, in turn, can be used to define "collect" with the mutate-or-widen idiom. This came up in https://github.com/tkf/Transducers.jl/issues/73#issuecomment-548925178

FYI, it would have caught by Test.detect_unbound_args:

julia> using Test

julia> detect_unbound_args(TypedTables)
[1] _ndims(::Type{#s14} where #s14<:Tuple{Vararg{AbstractArray{#s13,n} where #s13,N} where N}) where n in TypedTables at /home/takafumi/.julia/packages/TypedTables/ZF2b3/src/TypedTables.jl:23

Why not run @test detect_unbound_args(TypedTables) == [] in the test? See also https://github.com/tkf/Aqua.jl

andyferris commented 4 years ago

Thanks @tkf! Not a bad idea.

Yes, this is a known issue. It's actually slightly more complicated than you'd think, since a zero-column table should be able to iterate an arbitray number of empty named-tuples. Just setting it to "empty" isn't a good option (it would anger the "Third Manifesto" gods, for one).

I'm thinking the constructor should check the arrays are the same length and we should store the keys directly in the Table. But then we'd have to do a bunch of work to implement push! etc correctly, and we'd have to use mutable struct (currently a Table full of SVectors can reside on the stack).

tkf commented 4 years ago

Naively thinking, I thought Table{NamedTuple{(), Tuple{}}} would be equivalent eto Vector{Union{}}. But maybe it should just be Table{Union{}}? I figured out that

Table{Union{}, 1, NamedTuple{(), Tuple{}}}(NamedTuple())

works (although show on it throws). So, how about defining

Table{Union{}, N}() where N = Table{Union{}, N, NamedTuple{(), Tuple{}}}(NamedTuple())

? I think this matches with the idea that "the constructor should check the arrays are the same length" since AbstractArray{Union{}} can only have length/size 0.

It's actually slightly more complicated than you'd think, since a zero-column table should be able to iterate an arbitray number of empty named-tuples.

Is it a relational algebra thing? It somewhat resembles the empty intersection/vacuous truth but I can't see the connection.

andyferris commented 4 years ago

I thought Table{NamedTuple{(), Tuple{}}} would be equivalent to Vector{Union{}}

It is, exactly right, Sorry I misread this. They are similar. Both can have a length greater than zero. I think Julia is clever enough not to attach a memory buffer to a Union{} array.

julia> Vector{Union{}}(undef, 3)
3-element Array{Union{},1}:
 #undef
 #undef
 #undef

You can't do much more with it except resize! it... Internally Set{T} uses Dict{T, Union{}} and there a Vector{Union{}} pops up; not sure who/what else uses them in practice.

Maybe we want a Table{Union{}, 1, NamedTuple{(), Tuple{<:AbstracArray{Union{}}}} Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{Union{}, N}}? Widen the type restrictions in the struct and specialize on this as necessary?

Is it a relational algebra thing? It somewhat resembles the empty intersection/vacuous truth but I can't see the connection.

Kind of, I guess... there's an isomorphism between the number of things in a bag and natural numbers. For set theory, Set{Union{}} could potentially have zero or one element (not sure if this one actually works in Julia?), which you might think of as isomorphic to false and true. The Third Manisto crowd believe this is important, and it's kind of useful to "join" such a relation on a second relation (returning either an empty relation or the second relation - so it functions a bit like an if statement kind of thing which is why they believe it's useful to create a useful SQL replacement where you can do proper logic).

andyferris commented 4 years ago

Maybe we should avoid Union{}.

julia> struct B
           a::Union{}
           b::Union{}
           B() = new()
       end

julia> B()
B(#undef, #undef)

julia> sizeof(B())
16

julia> struct C
           a::Nothing
           b::Nothing
           C() = new()
       end

julia> C()
C(nothing, nothing)

julia> sizeof(C())
0

This could pessimize a zero-column typed table of static arrays. A corner case, to be sure, but...

I'm now thinking the zero column form could be Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{NamedTuple{(),Tuple{}}, N}}.

The method specializations of that make a fascinating form, where a Table might either be row or column storage oriented. Interesting... wonder if we could go all the way and allow the same for one-or-more columns? (Would be a good way of getting Table printing for Vector{<:NamedTuple} without making a copy).

tkf commented 4 years ago

Oh, I forgot about Vector{Union{}}(undef, 3). Thanks for pointing it out.

Internally Set{T} uses Dict{T, Union{}} and there a Vector{Union{}} pops up; not sure who/what else uses them in practice.

Actually, I do. In Transducers.jl, collect(xf, xs) is defined as foldl(push!!, xf, xs, init=Union{}[]). The transducer xf part is not actually important here and what I'm doing is essentially

foldl(push!!, xs, init=Union{}[])

This implements mutate-or-widen strategy because push!!(ys, x) first tries to mutate if promote_type(x, eltype(ys)) <: eltype(ys) and falls back to vcat(ys, [x]) (roughly speaking).

Writing this, I just realized that how vcat behaves with empty Table is also important. That is to say, I also need

vcat(empty_table, Table(a=[1])) == Table(a=[1])

(or something similar as a wrapper function of vcat, if you don't want to have it in TypedTables.jl). I find special-casing eltype(empty_table) == Union{} is much nicer than special-casing eltype(empty_table) == NamedTuple{(), Tuple{}} as the latter is incompatible with throwing in vcat(a, b) if fieldnames(eltype(a)) != fieldnames(eltype(b)).

Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{Union{}, N}}

I think I want Table{Union{}, ...}. This is because

julia> promote_type(Union{}, typeof((a=1,)))
NamedTuple{(:a,),Tuple{Int64}}

julia> promote_type(typeof(NamedTuple()), typeof((a=1,)))
NamedTuple

so, the logic inside push!! function and friends become cleaner if there is a way to create an "empty table" with Union{} element type.

I'm now thinking the zero column form could be Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{NamedTuple{(),Tuple{}}, N}}.

Is <:AbstracArray for specifying the storage type? Does it mean empty table can only be row-oriented?

As for the analysis of Union{} field values, isn't it another reason for using Table{Union{}, 1, NamedTuple{(), Tuple{}}}(NamedTuple())? It's a singleton because .data is an empty NamedTuple

tkf commented 4 years ago

Reading a bit about relational algebra (BTW, thanks for the remark on joining with the empty relation), I'm more confident that I need Table{Union{}}, not Table{NamedTuple{()}} (at least within the framework I'm using ATM; but see the next paragraph). What I need is a table with undefined set of attributes, not a table with empty set of attributes. This is because I'd like to use it as the "initial value" of vcat. When treating vcat on tables as an implementation of the set union of the relations that are union-compatible, vcat(::Table{NamedTuple{()}}, ::Table{NamedTuple{(:a, :b)}}) should throw since they have different set of attributes.

Having said that, I think a bigger question is: what should be in the minimal sink-oriented tables interface? I've been thinking it would be:

  1. Empty container constructor: Vector{Union{}}, Table{Union{}}.
  2. Concatenation: vcat

Instead, maybe we should have the singleton container constructor like x -> [x] and x -> Table([x]), to work nicely with relational algebra. This would avoid us to introduce an uneasy concept like a table with undefined set of attributes.

tkf commented 4 years ago

I think I explained my implementation details too much. A simple question is: how do you write a function for materializing a Table from an arbitrary container (with possibly unknown element type)? You need to handle empty containers. What would you return in this case?

andyferris commented 4 years ago

Sorry, @tkf, I've been following along but haven't been able to reply yet.

A simple question is: how do you write a function for materializing a Table from an arbitrary container (with possibly unknown element type)?

That is a great question. At the moment here we are relying on the element type being known, and similar working out (through map, for example). That's because for simplicity I often treat Julia like a statically-typed language ("typed tables", after all) - because it's easy to get up and running and performance is pretty deterministic. But in general, we all need tools like push!! to get things working better in the dynamic world.

It's worth keeping in mind that there's no part of the AbstractArray iterface that requires every container "class" to support every element type. A concrete container only has to support a single eltype. Something similar to that should support setindex! but we haven't really codified what other traits it might have in common, or what traits vcat should preserve (also keep in mind Tables might be matrices, etc, so perhaps setindex_widen!! type approach could be more appropriate in some circumstances). In this particular case, what should happen when you push!! an Int onto a Table{Union{}}?

But yes, in general I can see we should support both zero-element Table{Union{}}s as well as n-element Table{NamedTuple{(), Tuple{}}}, and I can see a path towards that now (the latter by allowing a single AbstractVector "row table" in the data slot, possibly the former the same way or simply by providing enough overloads for that case that data is never actually accessed).

I think a bigger question is: what should be in the minimal sink-oriented tables interface

Let me ask a counter-question: what is the miminal sink-oriented array interface?

tkf commented 4 years ago

what is the miminal sink-oriented array interface?

Let me first answer a related simpler question: what is the miminal sink-oriented collection interface. So I'll forget about arrays with ndims != 1 and consider sets and dictionaries as well.

Approach 1: (singletonof, append!!)

I think one possible minimal set of API is:

  1. Singleton constructor singletonof ((type, element) -> collection): given an element, create a collection with it. For example, singletonof(::Type{Vector}, element) = [element].
  2. Mutate-or-widen version of append!; i.e., append!! as implemented in BangBang.jl. Another possible name is grow_to!, based on Base's internal function.

Once a collection type T implements those API, it can be constructed from arbitrary iterators (possibility of unknown/unstable element type):

mapfoldl(x -> singletonof(T, x), append!!, xs)

This code snippet itself is not really useful, since T(xs) would work most of the time. However, this is useful for constructing a collection in a hand-written loop.

Approach 2: (empty, push!!)

An alternative API is:

  1. Empty table constructor empty(::Type{T}) -> collection s.t. eltype(collection) == Union{}.
  2. Mutate-or-widen version of push!: push!!.

Note that using push!! instead of push! is crucial as otherwise you can't do anything with the collection returned by empty(T). Likewise, the choice of eltype(collection) == Union{} (instead of, say, Any) is important because this is the only type that can be widen to arbitrary types.

Comparison

Two APIs are isomorphic in the sense that the first API can be implemented in terms of the second API:

singletonof(T, row) = push!!(empty(T), row)
append!!(a, b) = foldl(push!!, b; init=a)

and vice versa:

struct Empty{T} end

empty(T) = Empty{T}()
push!!(::Empty{T}, x) = singletonof(T, x)
push!!(xs::T, x) where T = append!!(xs, singletonof(T, x))

(ATM, I chose to implement singletonof and then define Empty in terms of it.)

I think the second approach (empty, push!!) is better because you know what to return when the result is empty.

One argument for approach 1 (singletonof, append!!) may be that it is better to define append! for performance reason, especially for column-oriented tables. But it is easy to define push!! in terms of append!! if you already know a type that implements the singleton constructor; e.g., StaticArray:

push!!(xs::T, x) = append!!(xs, SA[x])

Also, it's not crazy to implement performance specialization for append!! even for approach 2 anyway.

Miminal sink-oriented array interface

For multi-dimensional arrays, as you mentioned, I think setindex_widen!! approach is the way to go. I'll just call it setindex!! as !! in the convention I use in BangBang already means "fallback to widening". So I'd suggest

  1. similar(T, eltype::Type{Union{}}, dims)
  2. setindex!!

I think it's OK to treat multi-dimensional arrays separately because you don't typically expect a multi-dimensional arrays to be re-resizable.

quinnj commented 4 years ago

Not sure if it's of interest, but there is code in Tables.jl that essentially builds up a "Table" (more specifically, a NamedTuple of Vectors) from an arbitrary property-accessible iterator, with or without known length (https://github.com/JuliaData/Tables.jl/blob/master/src/fallbacks.jl#L146). We essentially start each column out as Union{}[] (or Vector{Union{}}(undef, len)), and widen the column Vector as values are iterated. It's been pretty heavily optimized at this point and is just a hair slower than when you have a fully unrolled, known length set of values to put into a table.

tkf commented 4 years ago

@quinnj Hi, thanks for the comment. I know people implement widening-based collect-like functions in various places (Tables.jl, StructArrays.jl, ...) but it is rather sad that there is no canonical API for this. It would be great if there is a base package (CollectionBase?) that can formalize the functions like this. I was initially thinking to send an RFC PR/issue to Tables.jl or DataAPI.jl for above approaches but then realized that it's more general than tables; e.g., it makes sense to use the above API with dictionaries and sets.

andyferris commented 4 years ago

Regarding your two options - we need to support the case where the output container is empty with full type predictability. Are you falling back on inference whenever xs is empty?

but then realized that it's more general than tables; e.g., it makes sense to use the above API with dictionaries and sets.

Yes, which is precisely why I was asking my question :) In this case, for Dictionaries.jl, where I am trying to have proper interfaces for insertability, mutability, immutability, etc, and the next problem to solve is "factories" - help welcome on interface ideas.. (After that, Jacob, I want to make a dictionary-like table, as opposed to these array-like tables, but they need to iterate rows not key-value pairs for it to be a Tables.jl table, hence the need for a new dictionaries interface).

I've found there's a class of containers where similar and (key-preserving) setindex_widen!! are best, and there are situations where it might be better to insert elements.

tkf commented 4 years ago

I've found there's a class of containers where similar and (key-preserving) setindex_widen!! are best, and there are situations where it might be better to insert elements.

I agree. We can probably even say that the class of container is defined by the minimal requirements (as a sink). Something like:

tkf commented 4 years ago

By the way, I decided to go with "fake/lazy empty table" (Empty{T}) route in Transducers.jl + BangBang.jl: https://tkf.github.io/Transducers.jl/dev/manual/#Base.copy. So, this is not strictly a bottleneck for me, at least for now. The reasoning was that it is not practical for me to open an issue like this for each table/collection package and try to convince package authors to support Union{} eltype (which is normally useless).

andyferris commented 4 years ago

Ok. So what do you do when the output really is empty?

tkf commented 4 years ago

copy(xf::Transducer, T::Type, foldable) :: Union{T, Empty{T}} returns Empty(T) if the transducer does not produce anything. This is because there is no consistent interface to create an empty container given its type and not all containers support creating an empty container.

Empty(T) works like a T as long as you do duck typing; e.g., using generic !! functions and iterating over it. However, you still can't use function specific to T. This is exactly the reason why I suggested to implement empty(::Type{ContainerType}).

This is related to your comment in https://github.com/JuliaData/TypedTables.jl/pull/56#discussion_r344440013:

Oh you want to use types not instances? I’ve been heavily trending the other direction lately.

I understand that that not computing type by hand is the best practice (the worst example being return_type). (This is the reason why I'm trying to make mutate-or-widen approach work.)

But, I believe a container with Union{} eltype is special. This is the only instance that can be widen to anything else. To motivate this more, notice that empty(T) is an identity element of append! (when forget about mutation). So, we need empty(T) to treat append! as a monoid and this is a crucial thing to have for reduce.

(The reason why I could still implement something like copy(xf, T, foldable) is that you can create a monoid out of a semigroup by adjoining an identity element --- Empty(T). But, as I mentioned above, returning Empty(T) instead of a T has a usability issue when you are doing something specific to T.)

Moelf commented 3 years ago

on a related note, ndims for DataFrames are 2 while for TypedTables it's 1. Should they be the same?

andyferris commented 3 years ago

on a related note, ndims for DataFrames are 2 while for TypedTables it's 1.

Kind of. In TypedTables a Table is an AbstractArray of rows. Usually it is a 1D vector with a single index, but you can e.g. take the Cartesian join of two tables and end up with a 2D matrix (of rows) and thus create a Table with ndims equal 2.

The difference with dataframes is that a Table is always a nested container (like a vector of vectors) while for indexing a DataFrame behaves a little bit like a matrix, and AFAICT ndims is roughly meant to be the number of indices provided to getindex.

Should they be the same?

I'm not sure - here (and in Dictionaries.jl) I've tried to keep indexing and iteration semantics in line with each other. The DataFrames approach is more pragmatic and practical and I wouldn't argue that ndims(::DataFrame) = 2 is incorrect, since that matches how indexing works. (Though I am not sure if any generic code could actually make use of that fact - perhaps the REPL autocompletion?).