JuliaData / IndexedTables.jl

Flexible tables with ordered indices
https://juliadb.org
MIT License
121 stars 37 forks source link

reshaping? #70

Open tcovert opened 7 years ago

tcovert commented 7 years ago

What is a way of reshaping an IndexedTable? The common use case I have in mind is where a data Column contains a vector of strings, where the typical entry is of the form "a,b" and the goal is to separate the two string values into a longer column of individual string values. Here is a MWE:

julia> t0 = IndexedTable(Columns(id = [1, 2]), Columns(val = ["a,b", "c,d"]))
id │ val
───┼──────
1  │ "a,b"
2  │ "c,d"

julia> t1 = IndexedTable(Columns(id = [1, 1, 2, 2]), Columns(newval = ["a", "b", "c", "d"]))
id │ newval
───┼───────
1  │ "a"
1  │ "b"
2  │ "c"
2  │ "d"

Is there a way to get t1 from t0 using the function split? It seems this ought to be somehow possible with map, but I believe the f in map(f, t::IndexedTable) is expected to return a scalar or a NamedTuple, not a Columns object. Does mapslices do this?

The reverse of this operation seems to be do-able using aggregate.

Thanks in advance for any suggestions!

shashi commented 7 years ago
t0 = IndexedTable(Columns(id = [1, 2]), Columns(val = ["a,b", "c,d"]))
julia> t1 = mapslices(t0, ()) do slice; parts = split(first(slice).val, ",")
           IndexedTable(fill(1, length(parts)),parts)
       end
─────┬────
1  1 │ "a"
1  1 │ "b"
2  1 │ "c"
2  1 │ "d"

julia> select(t1, 1)
──┬────
1 │ "a"
1 │ "b"
2 │ "c"
2 │ "d"

It's required that an IndexedTable have at least 1 dimension, hence the extra dimension returned in mapslices.

tcovert commented 7 years ago

Thanks - more complicated than I would have thought.

Would it make sense to add another clause in the mapslices code that just checks if the function returns Columns instead of an IndexedTable, similar to how it differentiates between IndexedTable and scalar return values? That would seem to be easier on the user and would not necessitate dropping an extra key column after the fact.

It would be nice to be able to write something like this:

t1 = mapslices(x->Columns(newval = split(first(x).val, ",")), t0, ())  

but that currently triggers the "calling mapslices with no dimensions and scalar return value -- use map instead" error, and the equivalent map statement is only slightly better:

julia> map(x->Columns(newval = split(x.val, ",")), t0)
id │ 
───┼──────────────────────────────────────────────────────────────────────────
1  │ NamedTuples._NT_newval{SubString{String}}[(newval = "a"), (newval = "b")]
2  │ NamedTuples._NT_newval{SubString{String}}[(newval = "c"), (newval = "d")]
davidanthoff commented 7 years ago

Here is the Query.jl way to do this:

@from i in t0 begin
    @select {i.id, names=split(i.val,",")} into i
    @from j in i.names
    @select {i.id,newval=j}
    @collect IndexedTable
end
shashi commented 7 years ago

@tcovert I agree it's harder than it should be... It's easy to add a function that does this, it would fill a new dimension from 1:length of vector for every row. What would one call it? This is something like reduce(hcat, vector of vectors) rather than flatten.

@davidanthoff that looks pretty neat! what does the second select do?

davidanthoff commented 7 years ago

This query is actually two queries chained, i.e. a @select ... into i concludes the first query, and then right away starts a new query on those results with i as the range variable. That second query again needs to terminate with a @select statement, so that is what the second @select does.

The really neat thing is the @from in the middle. It is actually flattening the list of lists that the first query creates.

tcovert commented 7 years ago

In my mind, this could just be a part of map. In the event that f applied to an element of a IndexedTabe evaluates to a Columns(), map would construct a new IndexedTable with those columns as the data and the original index columns as the indices, though I agree that this violates the typical definition of map.

What if map returned an array of IndexTables in this case and the user could just use reduce(cat, map(...))?

tcovert commented 7 years ago

Another approach here would that respects the IndexedTable goal of having a single row per key would be to add the newly created column to the set of key columns.

davidanthoff commented 7 years ago

I think this is fundamentally a transformation that corresponds to the SelectMany query operation in LINQ, which the second @from clause in the Query.jl example is. I don't think this fits the semantics of map, which seems to have well defined semantics that don't really fit this case.