Closed shashi closed 6 years ago
Looks good to me!
Only potential minor issue is the continuous numbering of index and data columns. Maybe it's clear from context whether column numbers refer to the whole table or just the index/data part.
We can have a convention that rows(t::IndexedTable, which...)
and columns(t::IndexedTable, which...)
number the columns of the keys(t)
1:m
, and columns of values(t)
(m+1):n
. Also, if the data is a vector m + 1
to refers to it.
So tables won't be naturally iterable?
No, tables can still be iterable. Columns
will still iterate rows, and IndexedTable
will still iterate values like an array.
Maybe rows(t::IndexedTable, which...)
should be rows(columns(t, which...))
?
and
IndexedTable
will still iterate values like an array.
Out of curiosity, what is the function of this? As in, what have you used it for?
rows(columns(t, which...))
And something like rows(cols::Union{Tuple, NamedTuple}) = Columns(cols)
? That could work...
@andyferris map(f, t::IndexedTable)::IndexedTable
is similar to map
on sparse matrices... It also allows you to create new data columns by returning (named) tuples from f
-- that's pretty useful.
Where did sparse matrices come from? You mean because elements might be Null
?
OK, I guess I was expecting to map
over rows.
map(row -> (a = row.a, c = f(row.b)), indextable)
Here, I can change the column names as I wish. row
would include the key columns and all the other columns.
and
IndexedTable
will still iterate values like an array.
I'm trying to understand this - is a value a single field of a single row? (Or is it the half of the row which doesn't include the key/index columns?)
(Sorry for my ignorance, but I'm very curious/interested in how this will work).
Where did sparse matrices come from?
An IndexedTable
is an N-dimensional sparse cube... It's indexed by arbitrary orderable values instead of 1:n, 1:m
etc. keys
will give you those indices. We wrap n-d array operations like mapslices
, permutedims
, broadcast
and reducedim
etc, also 2-arg map is defined using innerjoin
.
One difference from N-D array is its dimensions are not bounded... We haven't seen many weird implications of that in practice, moreover we know the bounding box so we can treat it as a cube if we need to. Another difference is dimensions can be continuous. Again, no issues have come up because of this.
I'm trying to understand this - is a value a single field of a single row? (Or is the half of the row which isn't in the key/index columns?)
Conceptually, the values are stored as a single vector. Columns
is just columnar storage for tuples. Columns
is a subtype of AbstractVector.
So another way to look at an IndexedTable
is a mapping between a keys vector and a values vector. The keys vector is always a vector of tuples (aka Columns
).
julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), [3,4])
x y │
─────┼──
1 1 │ 3
1 3 │ 4
julia> ndims(t)
2
julia>dimlabels(t)
[:x, :y]
julia> eltype(t)
Int64
julia> t[1,3]
4
julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), Columns(([3,4], [5,6])))
x y │
─────┼─────
1 1 │ 3 5
1 3 │ 4 6
julia> eltype(t)
Tuple{Int64,Int64}
julia> t[1,3]
(4, 6)
julia> t = IndexedTable(Columns(x=[1,1], y=[1,3]), Columns(a=[3,4], b=[5,6]))
x y │ a b
─────┼─────
1 1 │ 3 5
1 3 │ 4 6
julia> eltype(t)
NamedTuples._NT_a_b{Int64,Int64}
julia> t[1,3]
(a = 4, b = 6)
map(row -> (a = row.a, c = f(row.b)), indextable) row would include the key columns and all the other columns.
The question is what's the output indexed by? (considering map(f, ::IndexedTable)
should give an IndexedTable
)
What this proposal allows is slicing and dicing the values or keys in terms of their columnar representation and then iterating over the resultant vectors. With this proposal, you could write map(row -> (a = row.a, c = f(row.b)), rows(indextable))
to have your way. It would return a Columns
, an AbstractVector
. You can then use the resultant Columns
as either the keys or values to construct a new table / subset it and create a new one etc.
Great questions, I suppose we have some work to do in terms of documenting this. Cheers!
An IndexedTable
is an "opinionated" view of a table where some column(s) is/are designated as the "data" you're more interested in, so for example functions like maximum(itable)
work automatically and give the maximum data value. Or you can have tables called speed
and time
, and write speed .* time
and it will join on the index parts and call *
on the data.
@sashi Thank you for the detailed answer. :)
An
IndexedTable
is an N-dimensional sparse cube...
I have had a few conversations with Jeff about this. It's my opinion that the analogy with a "sparse subset of a N-D space of indices" will prove to be counter-productive.
If something has N dimenensions (or lives in a N dimensional space) then what we mean is that the space is the Cartesian product of N spaces. It has been pointed out to me that "dimensions" can mean slightly different things (you can say a matrix is a 2D container, or you can say you have a 2D vector) but in all cases you are taking the Cartesian products (a matrix may live in ℝⁿ ⊗ ℝᵐ, a 2-vector may live in ℝ² which is a Cartesian product of ℝ with ℝ). A sparse matrix is just a dense matrix where you are using compression techniques to use less memory, taking advantage of the fact that the dense matrix has many elements equal to zero. The interface and our discussion of a sparse matrix is the same as a dense matrix but it may do some things faster.
On the other hand, IndexedTable
is definitely making use of the same kind of sparsity-pattern we use to implement SparseMatrixCSC
. However, the interpretation is quite different... If you have a table like
x y │ a
────────
1 1 │ 3
1 2 │ 1
1 4 │ 5
2 3 │ 4
2 4 │ 0
then the indices do not live in (1,2) ⊗ (1,2,3,4)
because there is no element at (x=1, y=3)
. In a sparse 2D structure all such elements exists, and they may be e.g. Null
, but as I understand IndexedTable
this is not at all the interpretation.
I would rather say that the example table is better viewed as a mapping between the 1D list of 2-tuple indices ((1,1), (1,2), (1,4), (2,3), (2,4))
and the values (3,1,5,3,0)
. You are taking advantage of sort-ordering to enable fast lookup, and you might say you conceptually store the indices as (1 ⊗ (1,2,3) ⊕ 2 ⊗ (3,4))
. This kind of "nested" (lexicographically sorted) indexing layout is pretty powerful but it's not quite right to say its N-dimensional. (I'm sorry - this got really mathy really quickly!! But I hope you understand? I feel poor language and subtle misunderstandings can lead to poor design decisions in the future)
So another way to look at an
IndexedTable
is a mapping between a keys vector and a values vector. The keys vector is always a vector of tuples (akaColumns
).
Yep, I was beginning to see this.
So I have a proposal in this space, to try and unify how we deal with tables and relations in Julia. I spoke with Jeff at JuliaCon hackathon and have since spent a couple days playing around with relations, syntax, iteration, and so-on.
I have heard the arguments surrounding, e.g., having Dict
iterate values rather than pairs, on the basis that the keys are metadata. This seems fine to me - no objections. For an example in the table space, the row index in a DataFrame
is called Row
and basically is metadata. On the other hand, when I look at IndexedTable
, I see data to the left of the vertical line and data to the right. If you want to do arbitrary relational operations on IndexedTable
, to me it makes sense if we say it is a container of rows of all the columns (e.g. you can make a new table containing only x
). In a relational database, columns which are keys (or which are indexed by sorting, etc, but not necessarily unique) as treated equally as data and equally by syntax in SQL. However, the RDMS knows which columns are keys, which are sorted, and so on, and can use this information to execute the query (and each operation in the query like join
) quicker.
I would like to propose we try mimic the same thing in Julia. That is, we promote a "relational interface" where tables are seen as collections of rows (of all the data fields, i.e. those that participate in the relation). For some tables like DataFrame
where the index used in getindex
is just the row number, and this index might be metadata rather than a field. It is not a part of the relation we are trying to represent. For tables with key columns, the key column can simultaneously be data and also what is used in getindex
to fetch rows quickly. There is no paradox there. The "relational algebra" interface could be richer than just bare iteration of rows - we should be able to interrogate a table and ask it if it has any uniqueness constraints and any acceleration indexes.
An IndexedTable is an "opinionated" view of a table where some column(s) is/are designated as the "data" you're more interested in, so for example functions like maximum(itable) work automatically and give the maximum data value. Or you can have tables called
speed
andtime
, and writespeed .* time
and it will join on the index parts and call*
on the data.
OK, that seems like cool use of broadcast
. I was hoping at some point we would have sensible (perhaps relational-algebra motivated, or perhaps indexing motivated) broadcasting over tables.
(OTOH I would feel slightly sad if I had to type rows(table)
whenever I wanted to see the relational structure, because I'd like to treat IndexTable
and tables from other packages uniformly as relations)
Ok, I can understand that you don't want to write rows(table)
, but other people want to be able to write a .* b
and the kind of getindex
we have here (of course there might not be a conflict there --- I don't know if the relational interface will contain getindex
). A zero-tolerance policy on interface variations is too limiting.
A zero-tolerance policy on interface variations is too limiting.
I agree with Jeff here, Andy. Consumers of IndexedTables could use rows(t)
when appropriate...
That said, I kinda wish rows(t)[5:10]
(or filter on rows(t)
) would give the subset of rows but as an IndexedTable with the same structure. But there might be an elegant solution in:
select(t) do table
rows(table)[5:10] # or filter
end
This was brought up by @VenkatSubramaniam's experiments with the package.
My current thinking for a coherent API:
columns
-- returns (named) tuples of column vectors themselves.Not sure how useful this is for users, but it's useful to implement row iterators below.
rows
-- iterator over rows.It's a little funny that
columns
returns a tuple, androws
returnsColumns
. :) I find this quite apt, but will admit it might be confusing because of the naming.keys
,values
andpairs
-- iterators over rows that acknowledge the key-value pair view of IndexedTablepairs(t::IndexedTable) = Columns(keys(t), values(t)) # nested Columns