JuliaData / IndexedTables.jl

Flexible tables with ordered indices
https://juliadb.org
MIT License
120 stars 38 forks source link

RFC: Iteration API #63

Closed shashi closed 6 years ago

shashi commented 6 years ago

My current thinking for a coherent API:

  1. columns -- returns (named) tuples of column vectors themselves.

    columns(c::Columns) = c.columns
    columns(c::Columns, which...) = c.columns[[which...]]
    columns(t::IndexedTable) = (t.index.columns..., t.data.columns...)
    # ^ named tuple if everything is named
    columns(t::IndexedTable, which...) = all columns with those names
    # ^ this one assumes uniqueness of names / continuity of numbering
    # among index and data cols

    Not sure how useful this is for users, but it's useful to implement row iterators below.

  2. rows -- iterator over rows.

rows(c::Columns) = c
rows(c::Columns, which...) = Columns(columns(c, which...))
rows(t::IndexedTable) = Columns(columns(t))
rows(c::IndexedTable, which...) = Columns(columns(t, which...))

It's a little funny that columns returns a tuple, and rows returns Columns. :) I find this quite apt, but will admit it might be confusing because of the naming.

  1. keys, values and pairs -- iterators over rows that acknowledge the key-value pair view of IndexedTable
    
    keys(t::IndexedTable, dims...) = Columns(columns(keys(t), dims...))
    values(t::IndexedTable, dims...) = Columns(columns(values(t), dims...))

pairs(t::IndexedTable) = Columns(keys(t), values(t)) # nested Columns


Bonus feature:

`which` arguments to `columns` (and hence `dims` args in other operations) can be `Symbol`, `Int` or *`As`* --[`As(f, src_name, dest_name)`](https://github.com/JuliaComputing/IndexedTables.jl/blob/37bb28e8f7e57fd874982af5454fd4cc11ad4ff2/src/query.jl#L33-L39) will cause a field to be renamed in the output named tuple and optionally apply a function `f` on it before jamming it into the output.

The exciting part is, all these can be implemented on `DTable`. If we had a common name for an IndexedTable constructor, (say `table(keys, vals)`), then we can dispatch to create either a `DTable` or an `IndexedTable` based on whether the inputs are distributed or not. I can imagine some operations (e.g. `select`, `permutedims`) could have a common implementation.

I will rebase #55 against this.

cc @JeffBezanson @StefanKarpinski @davidanthoff 
JeffBezanson commented 6 years ago

Looks good to me!

Only potential minor issue is the continuous numbering of index and data columns. Maybe it's clear from context whether column numbers refer to the whole table or just the index/data part.

shashi commented 6 years ago

We can have a convention that rows(t::IndexedTable, which...) and columns(t::IndexedTable, which...) number the columns of the keys(t) 1:m, and columns of values(t) (m+1):n. Also, if the data is a vector m + 1 to refers to it.

andyferris commented 6 years ago

So tables won't be naturally iterable?

JeffBezanson commented 6 years ago

No, tables can still be iterable. Columns will still iterate rows, and IndexedTable will still iterate values like an array.

Maybe rows(t::IndexedTable, which...) should be rows(columns(t, which...))?

andyferris commented 6 years ago

and IndexedTable will still iterate values like an array.

Out of curiosity, what is the function of this? As in, what have you used it for?

shashi commented 6 years ago

rows(columns(t, which...))

And something like rows(cols::Union{Tuple, NamedTuple}) = Columns(cols)? That could work...

shashi commented 6 years ago

@andyferris map(f, t::IndexedTable)::IndexedTable is similar to map on sparse matrices... It also allows you to create new data columns by returning (named) tuples from f -- that's pretty useful.

andyferris commented 6 years ago

Where did sparse matrices come from? You mean because elements might be Null?

OK, I guess I was expecting to map over rows.

map(row -> (a = row.a, c = f(row.b)), indextable)

Here, I can change the column names as I wish. row would include the key columns and all the other columns.

and IndexedTable will still iterate values like an array.

I'm trying to understand this - is a value a single field of a single row? (Or is it the half of the row which doesn't include the key/index columns?)

(Sorry for my ignorance, but I'm very curious/interested in how this will work).

shashi commented 6 years ago

Where did sparse matrices come from?

An IndexedTable is an N-dimensional sparse cube... It's indexed by arbitrary orderable values instead of 1:n, 1:m etc. keys will give you those indices. We wrap n-d array operations like mapslices, permutedims, broadcast and reducedim etc, also 2-arg map is defined using innerjoin.

One difference from N-D array is its dimensions are not bounded... We haven't seen many weird implications of that in practice, moreover we know the bounding box so we can treat it as a cube if we need to. Another difference is dimensions can be continuous. Again, no issues have come up because of this.

I'm trying to understand this - is a value a single field of a single row? (Or is the half of the row which isn't in the key/index columns?)

Conceptually, the values are stored as a single vector. Columns is just columnar storage for tuples. Columns is a subtype of AbstractVector.

So another way to look at an IndexedTable is a mapping between a keys vector and a values vector. The keys vector is always a vector of tuples (aka Columns).

julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), [3,4])
x  y │
─────┼──        
1  1 │ 3                        
1  3 │ 4

julia> ndims(t)
2

julia>dimlabels(t)
[:x, :y]

julia> eltype(t)
Int64

julia> t[1,3]
4

julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), Columns(([3,4], [5,6])))
x  y │
─────┼─────
1  1 │ 3  5
1  3 │ 4  6

julia> eltype(t)
Tuple{Int64,Int64}

julia> t[1,3]
(4, 6)

julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), Columns(a=[3,4], b=[5,6]))
x  y │ a  b
─────┼─────
1  1 │ 3  5
1  3 │ 4  6

julia> eltype(t)
NamedTuples._NT_a_b{Int64,Int64}

julia> t[1,3]
(a = 4, b = 6)

map(row -> (a = row.a, c = f(row.b)), indextable) row would include the key columns and all the other columns.

The question is what's the output indexed by? (considering map(f, ::IndexedTable) should give an IndexedTable)

What this proposal allows is slicing and dicing the values or keys in terms of their columnar representation and then iterating over the resultant vectors. With this proposal, you could write map(row -> (a = row.a, c = f(row.b)), rows(indextable)) to have your way. It would return a Columns, an AbstractVector. You can then use the resultant Columns as either the keys or values to construct a new table / subset it and create a new one etc.

Great questions, I suppose we have some work to do in terms of documenting this. Cheers!

JeffBezanson commented 6 years ago

An IndexedTable is an "opinionated" view of a table where some column(s) is/are designated as the "data" you're more interested in, so for example functions like maximum(itable) work automatically and give the maximum data value. Or you can have tables called speed and time, and write speed .* time and it will join on the index parts and call * on the data.

andyferris commented 6 years ago

@sashi Thank you for the detailed answer. :)

An IndexedTable is an N-dimensional sparse cube...

I have had a few conversations with Jeff about this. It's my opinion that the analogy with a "sparse subset of a N-D space of indices" will prove to be counter-productive.

If something has N dimenensions (or lives in a N dimensional space) then what we mean is that the space is the Cartesian product of N spaces. It has been pointed out to me that "dimensions" can mean slightly different things (you can say a matrix is a 2D container, or you can say you have a 2D vector) but in all cases you are taking the Cartesian products (a matrix may live in ℝⁿ ⊗ ℝᵐ, a 2-vector may live in ℝ² which is a Cartesian product of ℝ with ℝ). A sparse matrix is just a dense matrix where you are using compression techniques to use less memory, taking advantage of the fact that the dense matrix has many elements equal to zero. The interface and our discussion of a sparse matrix is the same as a dense matrix but it may do some things faster.

On the other hand, IndexedTable is definitely making use of the same kind of sparsity-pattern we use to implement SparseMatrixCSC. However, the interpretation is quite different... If you have a table like

x  y │ a
────────
1  1 │ 3
1  2 │ 1
1  4 │ 5
2  3 │ 4
2  4 │ 0

then the indices do not live in (1,2) ⊗ (1,2,3,4) because there is no element at (x=1, y=3). In a sparse 2D structure all such elements exists, and they may be e.g. Null, but as I understand IndexedTable this is not at all the interpretation.

I would rather say that the example table is better viewed as a mapping between the 1D list of 2-tuple indices ((1,1), (1,2), (1,4), (2,3), (2,4)) and the values (3,1,5,3,0). You are taking advantage of sort-ordering to enable fast lookup, and you might say you conceptually store the indices as (1 ⊗ (1,2,3) ⊕ 2 ⊗ (3,4)). This kind of "nested" (lexicographically sorted) indexing layout is pretty powerful but it's not quite right to say its N-dimensional. (I'm sorry - this got really mathy really quickly!! But I hope you understand? I feel poor language and subtle misunderstandings can lead to poor design decisions in the future)

So another way to look at an IndexedTable is a mapping between a keys vector and a values vector. The keys vector is always a vector of tuples (aka Columns).

Yep, I was beginning to see this.

So I have a proposal in this space, to try and unify how we deal with tables and relations in Julia. I spoke with Jeff at JuliaCon hackathon and have since spent a couple days playing around with relations, syntax, iteration, and so-on.

I have heard the arguments surrounding, e.g., having Dict iterate values rather than pairs, on the basis that the keys are metadata. This seems fine to me - no objections. For an example in the table space, the row index in a DataFrame is called Row and basically is metadata. On the other hand, when I look at IndexedTable, I see data to the left of the vertical line and data to the right. If you want to do arbitrary relational operations on IndexedTable, to me it makes sense if we say it is a container of rows of all the columns (e.g. you can make a new table containing only x). In a relational database, columns which are keys (or which are indexed by sorting, etc, but not necessarily unique) as treated equally as data and equally by syntax in SQL. However, the RDMS knows which columns are keys, which are sorted, and so on, and can use this information to execute the query (and each operation in the query like join) quicker.

I would like to propose we try mimic the same thing in Julia. That is, we promote a "relational interface" where tables are seen as collections of rows (of all the data fields, i.e. those that participate in the relation). For some tables like DataFrame where the index used in getindex is just the row number, and this index might be metadata rather than a field. It is not a part of the relation we are trying to represent. For tables with key columns, the key column can simultaneously be data and also what is used in getindex to fetch rows quickly. There is no paradox there. The "relational algebra" interface could be richer than just bare iteration of rows - we should be able to interrogate a table and ask it if it has any uniqueness constraints and any acceleration indexes.

An IndexedTable is an "opinionated" view of a table where some column(s) is/are designated as the "data" you're more interested in, so for example functions like maximum(itable) work automatically and give the maximum data value. Or you can have tables called speed and time, and write speed .* time and it will join on the index parts and call * on the data.

OK, that seems like cool use of broadcast. I was hoping at some point we would have sensible (perhaps relational-algebra motivated, or perhaps indexing motivated) broadcasting over tables.

(OTOH I would feel slightly sad if I had to type rows(table) whenever I wanted to see the relational structure, because I'd like to treat IndexTable and tables from other packages uniformly as relations)

JeffBezanson commented 6 years ago

Ok, I can understand that you don't want to write rows(table), but other people want to be able to write a .* b and the kind of getindex we have here (of course there might not be a conflict there --- I don't know if the relational interface will contain getindex). A zero-tolerance policy on interface variations is too limiting.

shashi commented 6 years ago

A zero-tolerance policy on interface variations is too limiting.

I agree with Jeff here, Andy. Consumers of IndexedTables could use rows(t) when appropriate...

That said, I kinda wish rows(t)[5:10] (or filter on rows(t)) would give the subset of rows but as an IndexedTable with the same structure. But there might be an elegant solution in:

select(t) do table
   rows(table)[5:10] # or filter
end

This was brought up by @VenkatSubramaniam's experiments with the package.