HCADatalab / powderkeg

Live-coding the cluster!
Eclipse Public License 1.0
159 stars 23 forks source link

Support DataFrame/DataSet #2

Open cgrand opened 7 years ago

kovasb commented 7 years ago

I love this project.

Any updates on DataFrame support?

I imagine there's a lot of fun to be had translating clojure.spec's <=> dataframe schemas...

cgrand commented 7 years ago

Hi @kovasb,

I would welcome your input on what value could be offered on data(frame|set).

Datasets have a mapPartitions method so transducers-based approach is possible but it's just scratching the surface.

Eliminating references to Row (as we did with Tuple2) would be cool.

Records and spec are ways to get schemas but I'm not sure to see how to put everything together.

We would really appreciate any help with the design.

Thanks.

kovasb commented 7 years ago

so i have a few ideas

  1. Make it easy to create dataframe from clojure data

    • There is a subset of spec that maps to dataframe schemas
    • Given a spec, need a function (relax-spec the-spec) that will relax the-spec until it contains only mappable subset. Now we can a spec that matches the original data, and can be mapped to dataframes. -- Can use 'describe' to walk specs. -- can this be automated by providing a spec of the subset, and then using conform on the user spec?
    • Implement the automatic transformation of the data into the dataframe schema, so the user doesn't have to do it themselves (this is the payoff)
  2. Use specs within dataframe operations

cgrand commented 7 years ago

Slowly sinking in. spec wasn't a thing last time I thought about DF. It totally makes sense.

cgrand commented 7 years ago

Walking through s/every is buggy (cf http://dev.clojure.org/jira/browse/CLJ-2035) but otherwise some PoC mapping works well:

=> (s/conform ::datatype (s/form (s/* (s/tuple string? int?))))
#object[org.apache.spark.sql.types.ArrayType 0x78cc0c02 "ArrayType(StructType(StructField(0,StringType,true), StructField(1,LongType,true)),true)"]

See https://gist.github.com/cgrand/dd1c71feb6c4a05194f9bae8ed8b1998 for impl