Open tkf opened 4 years ago
Thanks @tkf! Not a bad idea.
Yes, this is a known issue. It's actually slightly more complicated than you'd think, since a zero-column table should be able to iterate an arbitray number of empty named-tuples. Just setting it to "empty" isn't a good option (it would anger the "Third Manifesto" gods, for one).
I'm thinking the constructor should check the arrays are the same length and we should store the keys
directly in the Table
. But then we'd have to do a bunch of work to implement push!
etc correctly, and we'd have to use mutable struct
(currently a Table
full of SVector
s can reside on the stack).
Naively thinking, I thought Table{NamedTuple{(), Tuple{}}}
would be equivalent eto Vector{Union{}}
. But maybe it should just be Table{Union{}}
? I figured out that
Table{Union{}, 1, NamedTuple{(), Tuple{}}}(NamedTuple())
works (although show
on it throws). So, how about defining
Table{Union{}, N}() where N = Table{Union{}, N, NamedTuple{(), Tuple{}}}(NamedTuple())
? I think this matches with the idea that "the constructor should check the arrays are the same length" since AbstractArray{Union{}}
can only have length/size 0.
It's actually slightly more complicated than you'd think, since a zero-column table should be able to iterate an arbitray number of empty named-tuples.
Is it a relational algebra thing? It somewhat resembles the empty intersection/vacuous truth but I can't see the connection.
I thought
Table{NamedTuple{(), Tuple{}}}
would be equivalent toVector{Union{}}
It is, exactly right, Sorry I misread this. They are similar. Both can have a length greater than zero. I think Julia is clever enough not to attach a memory buffer to a Union{}
array.
julia> Vector{Union{}}(undef, 3)
3-element Array{Union{},1}:
#undef
#undef
#undef
You can't do much more with it except resize!
it... Internally Set{T}
uses Dict{T, Union{}}
and there a Vector{Union{}}
pops up; not sure who/what else uses them in practice.
Maybe we want a Table{Union{}, 1, NamedTuple{(), Tuple{<:AbstracArray{Union{}}}}
Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{Union{}, N}}
? Widen the type restrictions in the struct
and specialize on this as necessary?
Is it a relational algebra thing? It somewhat resembles the empty intersection/vacuous truth but I can't see the connection.
Kind of, I guess... there's an isomorphism between the number of things in a bag and natural numbers. For set theory, Set{Union{}}
could potentially have zero or one element (not sure if this one actually works in Julia?), which you might think of as isomorphic to false
and true
. The Third Manisto crowd believe this is important, and it's kind of useful to "join" such a relation on a second relation (returning either an empty relation or the second relation - so it functions a bit like an if
statement kind of thing which is why they believe it's useful to create a useful SQL replacement where you can do proper logic).
Maybe we should avoid Union{}
.
julia> struct B
a::Union{}
b::Union{}
B() = new()
end
julia> B()
B(#undef, #undef)
julia> sizeof(B())
16
julia> struct C
a::Nothing
b::Nothing
C() = new()
end
julia> C()
C(nothing, nothing)
julia> sizeof(C())
0
This could pessimize a zero-column typed table of static arrays. A corner case, to be sure, but...
I'm now thinking the zero column form could be Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{NamedTuple{(),Tuple{}}, N}}
.
The method specializations of that make a fascinating form, where a Table
might either be row or column storage oriented. Interesting... wonder if we could go all the way and allow the same for one-or-more columns? (Would be a good way of getting Table
printing for Vector{<:NamedTuple}
without making a copy).
Oh, I forgot about Vector{Union{}}(undef, 3)
. Thanks for pointing it out.
Internally
Set{T}
usesDict{T, Union{}}
and there aVector{Union{}}
pops up; not sure who/what else uses them in practice.
Actually, I do. In Transducers.jl, collect(xf, xs)
is defined as foldl(push!!, xf, xs, init=Union{}[])
. The transducer xf
part is not actually important here and what I'm doing is essentially
foldl(push!!, xs, init=Union{}[])
This implements mutate-or-widen strategy because push!!(ys, x)
first tries to mutate if promote_type(x, eltype(ys)) <: eltype(ys)
and falls back to vcat(ys, [x])
(roughly speaking).
Writing this, I just realized that how vcat
behaves with empty Table
is also important. That is to say, I also need
vcat(empty_table, Table(a=[1])) == Table(a=[1])
(or something similar as a wrapper function of vcat
, if you don't want to have it in TypedTables.jl). I find special-casing eltype(empty_table) == Union{}
is much nicer than special-casing eltype(empty_table) == NamedTuple{(), Tuple{}}
as the latter is incompatible with throwing in vcat(a, b)
if fieldnames(eltype(a)) != fieldnames(eltype(b))
.
Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{Union{}, N}}
I think I want Table{Union{}, ...}
. This is because
julia> promote_type(Union{}, typeof((a=1,)))
NamedTuple{(:a,),Tuple{Int64}}
julia> promote_type(typeof(NamedTuple()), typeof((a=1,)))
NamedTuple
so, the logic inside push!!
function and friends become cleaner if there is a way to create an "empty table" with Union{}
element type.
I'm now thinking the zero column form could be
Table{NamedTuple{(),Tuple{}}, N, <:AbstracArray{NamedTuple{(),Tuple{}}, N}}
.
Is <:AbstracArray
for specifying the storage type? Does it mean empty table can only be row-oriented?
As for the analysis of Union{}
field values, isn't it another reason for using Table{Union{}, 1, NamedTuple{(), Tuple{}}}(NamedTuple())
? It's a singleton because .data
is an empty NamedTuple
Reading a bit about relational algebra (BTW, thanks for the remark on joining with the empty relation), I'm more confident that I need Table{Union{}}
, not Table{NamedTuple{()}}
(at least within the framework I'm using ATM; but see the next paragraph). What I need is a table with undefined set of attributes, not a table with empty set of attributes. This is because I'd like to use it as the "initial value" of vcat
. When treating vcat
on tables as an implementation of the set union of the relations that are union-compatible, vcat(::Table{NamedTuple{()}}, ::Table{NamedTuple{(:a, :b)}})
should throw since they have different set of attributes.
Having said that, I think a bigger question is: what should be in the minimal sink-oriented tables interface? I've been thinking it would be:
Vector{Union{}}
, Table{Union{}}
.vcat
Instead, maybe we should have the singleton container constructor like x -> [x]
and x -> Table([x])
, to work nicely with relational algebra. This would avoid us to introduce an uneasy concept like a table with undefined set of attributes.
I think I explained my implementation details too much. A simple question is: how do you write a function for materializing a Table
from an arbitrary container (with possibly unknown element type)? You need to handle empty containers. What would you return in this case?
Sorry, @tkf, I've been following along but haven't been able to reply yet.
A simple question is: how do you write a function for materializing a Table from an arbitrary container (with possibly unknown element type)?
That is a great question. At the moment here we are relying on the element type being known, and similar
working out (through map
, for example). That's because for simplicity I often treat Julia like a statically-typed language ("typed tables", after all) - because it's easy to get up and running and performance is pretty deterministic. But in general, we all need tools like push!!
to get things working better in the dynamic world.
It's worth keeping in mind that there's no part of the AbstractArray
iterface that requires every container "class" to support every element type. A concrete container only has to support a single eltype
. Something similar
to that should support setindex!
but we haven't really codified what other traits it might have in common, or what traits vcat
should preserve (also keep in mind Table
s might be matrices, etc, so perhaps setindex_widen!!
type approach could be more appropriate in some circumstances). In this particular case, what should happen when you push!!
an Int
onto a Table{Union{}}
?
But yes, in general I can see we should support both zero-element Table{Union{}}
s as well as n
-element Table{NamedTuple{(), Tuple{}}}
, and I can see a path towards that now (the latter by allowing a single AbstractVector
"row table" in the data
slot, possibly the former the same way or simply by providing enough overloads for that case that data
is never actually accessed).
I think a bigger question is: what should be in the minimal sink-oriented tables interface
Let me ask a counter-question: what is the miminal sink-oriented array interface?
what is the miminal sink-oriented array interface?
Let me first answer a related simpler question: what is the miminal sink-oriented collection interface. So I'll forget about arrays with ndims != 1
and consider sets and dictionaries as well.
(singletonof, append!!)
I think one possible minimal set of API is:
singletonof
((type, element) -> collection
): given an element, create a collection with it. For example, singletonof(::Type{Vector}, element) = [element]
.append!
; i.e., append!!
as implemented in BangBang.jl. Another possible name is grow_to!
, based on Base
's internal function.Once a collection type T
implements those API, it can be constructed from arbitrary iterators (possibility of unknown/unstable element type):
mapfoldl(x -> singletonof(T, x), append!!, xs)
This code snippet itself is not really useful, since T(xs)
would work most of the time. However, this is useful for constructing a collection in a hand-written loop.
(empty, push!!)
An alternative API is:
empty(::Type{T}) -> collection
s.t. eltype(collection) == Union{}
.push!
: push!!
.Note that using push!!
instead of push!
is crucial as otherwise you can't do anything with the collection returned by empty(T)
. Likewise, the choice of eltype(collection) == Union{}
(instead of, say, Any
) is important because this is the only type that can be widen to arbitrary types.
Two APIs are isomorphic in the sense that the first API can be implemented in terms of the second API:
singletonof(T, row) = push!!(empty(T), row)
append!!(a, b) = foldl(push!!, b; init=a)
and vice versa:
struct Empty{T} end
empty(T) = Empty{T}()
push!!(::Empty{T}, x) = singletonof(T, x)
push!!(xs::T, x) where T = append!!(xs, singletonof(T, x))
(ATM, I chose to implement singletonof
and then define Empty
in terms of it.)
I think the second approach (empty, push!!)
is better because you know what to return when the result is empty.
One argument for approach 1 (singletonof, append!!)
may be that it is better to define append!
for performance reason, especially for column-oriented tables. But it is easy to define push!!
in terms of append!!
if you already know a type that implements the singleton constructor; e.g., StaticArray:
push!!(xs::T, x) = append!!(xs, SA[x])
Also, it's not crazy to implement performance specialization for append!!
even for approach 2 anyway.
For multi-dimensional arrays, as you mentioned, I think setindex_widen!!
approach is the way to go. I'll just call it setindex!!
as !!
in the convention I use in BangBang
already means "fallback to widening". So I'd suggest
similar(T, eltype::Type{Union{}}, dims)
setindex!!
I think it's OK to treat multi-dimensional arrays separately because you don't typically expect a multi-dimensional arrays to be re-resizable.
Not sure if it's of interest, but there is code in Tables.jl that essentially builds up a "Table" (more specifically, a NamedTuple of Vectors) from an arbitrary property-accessible iterator, with or without known length (https://github.com/JuliaData/Tables.jl/blob/master/src/fallbacks.jl#L146). We essentially start each column out as Union{}[]
(or Vector{Union{}}(undef, len)
), and widen the column Vector
as values are iterated. It's been pretty heavily optimized at this point and is just a hair slower than when you have a fully unrolled, known length set of values to put into a table.
@quinnj Hi, thanks for the comment. I know people implement widening-based collect-like functions in various places (Tables.jl, StructArrays.jl, ...) but it is rather sad that there is no canonical API for this. It would be great if there is a base package (CollectionBase
?) that can formalize the functions like this. I was initially thinking to send an RFC PR/issue to Tables.jl or DataAPI.jl for above approaches but then realized that it's more general than tables; e.g., it makes sense to use the above API with dictionaries and sets.
Regarding your two options - we need to support the case where the output container is empty with full type predictability. Are you falling back on inference whenever xs
is empty?
but then realized that it's more general than tables; e.g., it makes sense to use the above API with dictionaries and sets.
Yes, which is precisely why I was asking my question :) In this case, for Dictionaries.jl, where I am trying to have proper interfaces for insertability, mutability, immutability, etc, and the next problem to solve is "factories" - help welcome on interface ideas.. (After that, Jacob, I want to make a dictionary-like table, as opposed to these array-like tables, but they need to iterate rows not key-value pairs for it to be a Tables.jl table, hence the need for a new dictionaries interface).
I've found there's a class of containers where similar
and (key-preserving) setindex_widen!!
are best, and there are situations where it might be better to insert elements.
I've found there's a class of containers where
similar
and (key-preserving)setindex_widen!!
are best, and there are situations where it might be better to insert elements.
I agree. We can probably even say that the class of container is defined by the minimal requirements (as a sink). Something like:
(empty, push!!)
: set(similar, setindex!!)
: multi-dimensional array(empty, push!!, setindex!!)
: dict(empty, push!!, similar, setindex!!)
: vectorBy the way, I decided to go with "fake/lazy empty table" (Empty{T}
) route in Transducers.jl + BangBang.jl: https://tkf.github.io/Transducers.jl/dev/manual/#Base.copy. So, this is not strictly a bottleneck for me, at least for now. The reasoning was that it is not practical for me to open an issue like this for each table/collection package and try to convince package authors to support Union{}
eltype (which is normally useless).
Ok. So what do you do when the output really is empty?
copy(xf::Transducer, T::Type, foldable) :: Union{T, Empty{T}}
returns Empty(T)
if the transducer does not produce anything. This is because there is no consistent interface to create an empty container given its type and not all containers support creating an empty container.
Empty(T)
works like a T
as long as you do duck typing; e.g., using generic !!
functions and iterating over it. However, you still can't use function specific to T
. This is exactly the reason why I suggested to implement empty(::Type{ContainerType})
.
This is related to your comment in https://github.com/JuliaData/TypedTables.jl/pull/56#discussion_r344440013:
Oh you want to use types not instances? I’ve been heavily trending the other direction lately.
I understand that that not computing type by hand is the best practice (the worst example being return_type
). (This is the reason why I'm trying to make mutate-or-widen approach work.)
But, I believe a container with Union{}
eltype is special. This is the only instance that can be widen to anything else. To motivate this more, notice that empty(T)
is an identity element of append!
(when forget about mutation). So, we need empty(T)
to treat append!
as a monoid and this is a crucial thing to have for reduce
.
(The reason why I could still implement something like copy(xf, T, foldable)
is that you can create a monoid out of a semigroup by adjoining an identity element --- Empty(T)
. But, as I mentioned above, returning Empty(T)
instead of a T
has a usability issue when you are doing something specific to T
.)
on a related note, ndims
for DataFrames are 2 while for TypedTables it's 1. Should they be the same?
on a related note, ndims for DataFrames are 2 while for TypedTables it's 1.
Kind of. In TypedTables a Table
is an AbstractArray
of rows. Usually it is a 1D vector with a single index, but you can e.g. take the Cartesian join of two tables and end up with a 2D matrix (of rows) and thus create a Table
with ndims
equal 2
.
The difference with dataframes is that a Table
is always a nested container (like a vector of vectors) while for indexing a DataFrame
behaves a little bit like a matrix, and AFAICT ndims
is roughly meant to be the number of indices provided to getindex
.
Should they be the same?
I'm not sure - here (and in Dictionaries.jl) I've tried to keep indexing and iteration semantics in line with each other. The DataFrames approach is more pragmatic and practical and I wouldn't argue that ndims(::DataFrame) = 2
is incorrect, since that matches how indexing works. (Though I am not sure if any generic code could actually make use of that fact - perhaps the REPL autocompletion?).
I tried to create an empty table by
Table()
but it failed athttps://github.com/JuliaData/TypedTables.jl/blob/fc1247a590be301341d172d447f706347bed727e/src/TypedTables.jl#L23
because
n
is undefined.My motivation is to use this with
push!!
so thatThis, in turn, can be used to define "
collect
" with the mutate-or-widen idiom. This came up in https://github.com/tkf/Transducers.jl/issues/73#issuecomment-548925178FYI, it would have caught by
Test.detect_unbound_args
:Why not run
@test detect_unbound_args(TypedTables) == []
in the test? See also https://github.com/tkf/Aqua.jl