JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
145 stars 25 forks source link

just for my learning... (please!) #89

Open lewisl opened 2 years ago

lewisl commented 2 years ago

struct Table{T <: NamedTuple, N, Data <: NamedTuple{<:Any, <:Tuple{Vararg{AbstractArray{<:Any,N}}}}} <: AbstractArray{T, N}

What is the type qualifier for the struct saying?

adigitoleo commented 2 years ago

This is a (nested) parametric type definition, which are IMHO the most complicated part of Julia's type system. If you're not familiar with them, I have opened a PR to try and clarify parametric types in the manual, might also be worth checking the linked issue.

Let's go through, from left to right, (maintainers please correct me if I'm wrong).

The Table type declares three things: the row types, a dummy "dimension", and the column types. It might first seem redundant to declare both row and column types, but this is necessary because the row type doesn't cover the types of the column names, nor the column container type.

T <: NamedTuple, N, Data (...)

T reifies to a NamedTuple that "maps" column names to a type, thus defining the type of any single row. Let's take an example table:

julia> t = Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Table with 2 columns and 3 rows:
     a  b
   ┌───────
 1 │ 1  2.0
 2 │ 2  4.0
 3 │ 3  6.0

julia> typeof(t[1])
NamedTuple{(:a, :b), Tuple{Int64, Float64}}

In this case T became NamedTuple{(:a, :b), Tuple{Int64, Float64}. The <: is necessary in the definition, because type parameters are invariant.

The N always resolves to 1 (see next snippet), and is necessary only so that we can have Table <: AbstractArray which means that tables inherit a bunch of nice methods. Basically, the Table is like a Vector of rows (recall that Vecetor is an alias for Array{T,1}).

Now the fun part, the data itself:

julia> typeof(t)
Table{NamedTuple{(:a, :b), Tuple{Int64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}}

julia> typeof(t.a)
Vector{Int64} (alias for Array{Int64, 1})

julia> typeof(t.b)
Vector{Float64} (alias for Array{Float64, 1})

The column-based data are stored in one big NamedTuple. The types of the column names themselves are not constrained (<:Any). Next, we have the type of the data column itself, which is again parametric. In this case, Tuple{Vararg{AbstractArray{<:Any,N}}} resolved to Tuple{Vector{Int64}, Vector{Float64}}. We must use Vararg because the number of columns (i.e. Vectors) is not known until the table is constructed. The same dummy "dimension" parameter can be re-used, because it will also always be 1 (no such thing as a 2D column).

I hope this clarifies things. If you have any suggestions on how to improve the documentation for parametric types, let me know and I can maybe include it in my PR. In fact, this type definition could serve nicely as a showcase example...

adigitoleo commented 2 years ago

I've just read in #55 that it's actually a bit more complicated in practice: you can end up with N = 2 tables in some cases. The discussion over in that issue should be consulted for more details, I have provided an incomplete overview.

adigitoleo commented 2 years ago

May I also suggest changing the title to something more descriptive (like "Understanding the Table type qualifier").

lewisl commented 2 years ago

Wow. That is really complicated.

What does the syntax of T <: NamedTuple, N, Data (…) say?

Does it mean that T is a subtype of each of the right hand side items?

Or does it imply nesting, as your explanation suggests:

NamedTuple has to include an N type and a Data type, which are themselves defined above the appearance of the T <: ...?

It’s so much folded into one statement that there is surely “magic” in how it works.

And what does the (…) signify? Does this refer to the column types within Data (the namedtuple of vectors), which we don’t want to be Any (well, I suppose that is allowed) but which can be each of the instance types in any example of an actual TypedTable—so that they all will match this pattern and inherit the methods of the supertypes.

You’ve given a great explanation of how the mechanics of TypedTables fits into the type system to benefit from method dispatch for the various super types.

But, the syntax of the T <: assertion remains a bit baffling. It is certainly compact but I don’t think that saving a handful of folks some typing (of the keyboard variety!—not the object variety) is a sound reason for pretty severe obscurity. Maybe your parametric types PR could address this. I’d say more typing (as long as it’s not a bunch of boilerplate to simplify parsing) to achieve clarity is probably a worthwhile trade-off.

------ Original Message ------ From: "Leon" @.**@.>> To: "JuliaData/TypedTables.jl" @.**@.>> Cc: "Lewis Levin" @.**@.>>; "Author" @.**@.>> Sent: 1/27/2022 5:55:31 AM Subject: Re: [JuliaData/TypedTables.jl] just for my learning... (please!) (Issue #89)

This is a (nested) parametric type definition, which are IMHO the most complicated part of Julia's type system. If you're not familiar with them, I have opened a PRhttps://github.com/JuliaLang/julia/pull/43891 to try and clarify parametric types in the manual, might also be worth checking the linked issue.

Let's go through, from left to right, (maintainers please correct me if I'm wrong).

The Table type declares three things: the row types, a dummy "dimension", and the column types. It might first seem redundant to declare both row and column types, but it will be shown that this is necessary.

T <: NamedTuple, N, Data (...)

T reifies to a NamedTuple that "maps" column names to a type, thus defining the type of any single row. Let's take an example table:

julia> t = Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])

Table with 2 columns and 3 rows:

 a  b

┌───────

1 │ 1 2.0

2 │ 2 4.0

3 │ 3 6.0

julia> typeof(t[1])

NamedTuple{(:a, :b), Tuple{Int64, Float64}}

In this case T became NamedTuple{(:a, :b), Tuple{Int64, Float64}. The <: is necessary in the definition, because type parameters are invarianthttps://github.com/adigitoleo/julia/blob/docs-man-types/doc/src/manual/types.md?plain=1#L556-L561.

The N always resolves to 1 (see next snippet), and is necessary only so that we can have Table <: AbstractArray which means that tables inherit a bunch of nice methodshttps://docs.julialang.org/en/v1/manual/interfaces/#man-interface-array. Basically, the Table is like a Vector of rows (recall that Vecetor is an alias for Array{T,1}).

Now the fun part, the data itself:

julia> typeof(t)

Table{NamedTuple{(:a, :b), Tuple{Int64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}}

julia> typeof(t.a)

Vector{Int64} (alias for Array{Int64, 1})

julia> typeof(t.b)

Vector{Float64} (alias for Array{Float64, 1})

The column-based data are stored in one big NamedTuple. The types of the column names themselves are not constrained (<:Any). Next, we have the type of the data column itself, which is again parametric. In this case, Tuple{Vararg{AbstractArray{<:Any,N}}} resolved to Tuple{Vector{Int64}, Vector{Float64}}. We must use Vararg because the number of columns (i.e. Vectors) is not known until the table is constructed. The same dummy "dimension" parameter can be re-used, because it will also always be 1 (no such thing as a 2D column).

I hope this clarifies things. If you have any suggestions on how to improve the documentation for parametric types, let me know and I can maybe include it in my PR. In fact, this type definition could serve nicely as a showcase example...

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaData/TypedTables.jl/issues/89#issuecomment-1022872965, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIYWLIPX6UYPRB2QGQYJATUYDM5HANCNFSM5LUKBJFQ. You are receiving this because you authored the thread.Message ID: @.***>

adigitoleo commented 2 years ago

What does the syntax of T <: NamedTuple, N, Data (…) say? And what does the (…) signify?

This is not Julia syntax, I just wrote it like that for brevity. The component T <: NamedTuple says that the type parameter T must be a NamedTuple*.

NamedTuple has to include an N type and a Data type, which are themselves defined above the appearance of the T <: ...?

No, the Table type itself contains three constituent types, which are declared as type parameters: T, N and Data. Each of these reifies to some concrete type when an instance of Table is created.

But, the syntax of the T <: assertion remains a bit baffling.

I agree that this is the most confusing part. Hopefully it makes more sense now? T <: NamedTuple just declares that T can be any type, so long as it is from the NamedTuple parametric family*. The commas are not a type union syntax, so the next parts, i.e. N and Data, are independent, constituent types.

Maybe your parametric types PR could address this.

My PR is only about changing documentation, and I doubt that changes to fundamental Julia syntax would be accepted at v1.7 of the language.

*It could seem confusing that T <: NamedTuple seems to assert a subtype relation between T and NamedTuple, despite the latter being a parametric type (which cannot be subtyped in Julia):

julia> isconcretetype(NamedTuple)
false

julia> isabstracttype(NamedTuple)
false

What's going on here? It's neither abstract nor concrete? I have highlighted this in my changes, but if both of these return false, then we are dealing with a parametric composite type. Parametric types aren't concrete, because they represent a family of types, but they need not represent a family of abstract types. In this case, the <: syntax is asserting that T is one of the types that is defined by the NamedTuple parametric type. I think this "overloading" of the <: syntax is what you find confusing, and I would tend to agree, but I'm not sure it's bad enough to justify changing the language.