MechanicalRabbit / DataKnots.jl

an extensible, practical and coherent algebra of query combinators
https://mechanicalrabbit.github.io/DataKnots.jl
Other
71 stars 5 forks source link

Properly convert a Dict into a DataKnot #11

Open haberdashPI opened 5 years ago

haberdashPI commented 5 years ago

Hi there! I'm excited by the potential of this library for my day to day work: I appreciate the elegant, coherent and highly compose-able nature of this approach.

I would expect the two following commands to result in the same DataKnot

convert(DataKnot,(joe = (bob = [1,2,3], bill=[3,4,5]),))
convert(DataKnot,Dict("joe" => Dict("bob" => [1,2,3], "bill"=>[3,4,5])))

Since—with using JSON—the latter format is how data is parsed.

Doesn't seem like it would be hard to implement. I'm happy to submit a pull request if there's interest.

xitology commented 5 years ago

Hi David! We are definitely interested in a smooth support for JSON data. Currently, it is difficult to convert a Dict to a DataKnot because a DataKnot needs to know the shape of the input data. For a NamedTuple-based structure, it's easy to get the full list of attributes and their types.

julia> typeof((joe = (bob = [1,2,3], bill=[3,4,5]),))
NamedTuple{(:joe,),Tuple{NamedTuple{(:bob, :bill),Tuple{Array{Int64,1},Array{Int64,1}}}}}

But that's not the case for a Dict object.

Doesn't seem like it would be hard to implement. I'm happy to submit a pull request if there's interest.

I'm not sure if a generic JSON support is a low-hanging fruit. Perhaps, some limited approach tailored to a specific data format may work well. That said, we'd love to have more contributors. Please let me know if you have any questions on the codebase. We usually linger on https://gitter.im/rbt-lang/rbt-proto.

haberdashPI commented 5 years ago

Perhaps I am missing something, but a Dict can be readily converted to a NamedTuple.

For example:

as_namedtuple(xs) = xs
function as_namedtuple(xs::Dict{<:AbstractString})
  kt = Tuple((Symbol(x) for x in keys(xs)))
  vt = Tuple(values(xs))
  NamedTuple{kt}(as_namedtuple.(vt))
end

There are obviously cases where this will fail, for particular values of vt, but it shouldn't be hard to handle all of the input you would expect to get from JSON.parse, for example.

haberdashPI commented 5 years ago

It occurred to me that what you might be saying is that even if a JSON file was represented as a NamedTuple structure, not all such structures could be handled by DataKnots: that seems okay. Fundamentally the reason would not be related to the Dict type in that case, but rather to the particular structure of the data.

If a JSON file has a format which could be handled by the appropriate NamedTuple representation, it seems nice to make that possible.

xitology commented 5 years ago

This transformation is indeed fragile, but, as you said, it's not really an issue.

However, we'd like to avoid any data transformations in the DataKnot constructor and just let it wrap the input value, whatever it is. Instead, any transformations should be preferably offloaded to the internal query engine. This ensures that values of the same type are treated consistently, regardless whether they are top-level or nested, or obtained as an intermediate query result.

The problem with Dict values is that the DataKnots query engine, in its current state, doesn't handle well the transformations where it cannot statically determine the shape of the output.

haberdashPI commented 5 years ago

I can't say I understand why, fully, for this particular case, it's important to maintain the original format, since the semantics of JSON are the same irrespective of whether they are stored as a Dict or a NamedTuple.

That said, I don't need to get everything about your design choices to appreciate their consequences, which has, so far, resulted in a system I'm excited to try out. And I can easily do what I want, just by using the following.

convert(DataKnot,as_namedtuple(JSON.parse(file)))

So it is not a big deal either way.

clarkevans commented 5 years ago

David, thanks for this ticket. I'm glad you found a simple work-around. Let's keep this ticket open till we've provided more direct support for JSON values, offloaded to the internal query engine.

We need to support JSON no matter how it is provided or mixed into the query. For example, we could have combinators that fetch external resources via socket requests, or, we may want our queries to work with JSON valued columns stored in a PostgreSQL database. Further, there are other data sources that are not JSON but have the same set of challenges.

haberdashPI commented 5 years ago

Ah hah! That makes sense now. Thank you for explaining it to me. This is really neat!

dom-esotec commented 4 years ago

Hi thanks for this package it's super elegant. I think one of the great things about this approach is the ability to handle non tabular data. So my first thought was to try it with some topojson, however I ran into this issue of converting a Dict to a DataKnot. I found a solution, hopefully it can be useful to you's.

using DataKnots, JSON
file = download("https://raw.githubusercontent.com/deldersveld/topojson/master/countries/china/china-provinces.json","china-province.json")
# Constructs a named tuple from the dict
function namedtuple(d::Dict{A,B}) where {A, B}
    keys = dictkeys(d)
    values = map(dictvalues(d)) do x
        namedtuple(x)
    end
    NamedTuple{keys}(values)
end

namedtuple(a::Array) = map(namedtuple, a)
namedtuple(x::Number) = x
namedtuple(x::AbstractString) = x
namedtuple(x::Any) = error("please define namedtuple for this type")

I then tried

nt = read(file, String) |> JSON.parse |> namedtuple;
DataKnot(:china => nt)
# Errror
# type UnionAll has no field parameters

Interestingly however it works if you recursively construct the DataKnot like the named tuple

function dataknot(nt::NamedTuple)
    nt_keys = keys(nt)
    nt_values = map(values(nt)) do v
        dataknot(v)
    end
    DataKnot([k => v for (k,v) in zip(nt_keys, nt_values)]...)
end

dataknot(a::Array) = map(dataknot,a)
dataknot(x::Number) = x
dataknot(x::AbstractString) = x
dataknot(x::Any) = error("please define dataknot for this type")

Then dataknot(nt) is all bless.

A side note that might be useful for tracking down the UnionAll error

# this works
b = DataKnot(:a => nt.objects.CHN_adm1.geometries)
c = DataKnot(:a => nt.objects.CHN_adm1.type)
d = DataKnot(:b => b, :c => c)
# but this errors
DataKnot(:chn => nt.objects.CHN_adm1)
onetonfoot commented 4 years ago

Oppps the above comment is me, I used the wrong github account.