drone29a / data-frame

A data frame protocol and implementation for Clojure.
Eclipse Public License 1.0
0 stars 0 forks source link

core.matrix links #1

Open mikera opened 9 years ago

mikera commented 9 years ago

Hi there @mattrepl

Can I suggest that if you do a data frame implementation that you try and roll it into core.matrix? I think this would help for several reasons:

There is in fact already a rudimentary dataset implementation, would be great to upgrade this to something better. Here's the link to the current code:

https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/impl/dataset.clj

drone29a commented 9 years ago

Hi, @mikera.

Ah, I didn't know there was already something in core.matrix. I'll take a closer look and see if we can merge the API I have in mind with the work already done in core.matrix.

My plan is to define a data frame protocol and write implementations for Clojure persistent vectors and core.matrix. The lib would also provide a handful of useful functions for manipulating tables.

Data frames will be used by a plotting library I'm writing and that library will support ClojureScript. Would it be easy (or desirable) for core.matrix to eventually support ClojureScript?

mikera commented 9 years ago

Yes, shift to ClojureScript is definitely possible and a target. From an API perspective it should be pretty simple, it will need different implementations though.

I think we need to ensure that array / dataset serialisation formats are nicely portable, but that should be easy.

@lbradstreet copying you in - you guys should probably be collaborating on this!

mikera commented 9 years ago

For those who haven't seen this discussion is worth checking out:

https://groups.google.com/d/topic/numerical-clojure/JHmhuK_vba0/discussion

lbradstreet commented 9 years ago

Hi @mattrepl,

I'd definitely be interested in what you had in mind implementation wise. The current core.matrix dataset format is essentially just a record with a vector of column names and a vector of vectors containing column values (in the same column order as the column names). However, this will be a bit limiting for what we have in mind with respect to visualisation / plotting.

Whatever we decide, it would be great to settle on a standard data frame format.

drone29a commented 9 years ago

Hi @lbradstreet:

What sort of plots do you want to be able to create? I'm aiming to replicate most of ggplot2's features but with a different API.

I pushed a rough draft of a data frame protocol and simple implementation. It lacks update/modification functions and labeled row/column vectors (when retrieved through get-row or get-col) but is enough for me to resume work on the plotting library. The end goal for the data frame library would be to add core.matrix implementations (dense and sparse?) and various transformation and utility functions akin to those found in dplyr and reshape.

lbradstreet commented 9 years ago

That sounds quite very close to our goals too. Ideally it would support higher levels of dimensionality in the data so that facets can be supported without necessarily splitting datasets into subsets (e.g. higher dimensional matrices). Of course, one option is to nest the dataframe, however I'm not sure whether this is a good idea yet.

Are you planning to use vega as the intermediate rendering language, as in ggvis?

mikera commented 9 years ago

FWIW I think that getting the abstraction right is more important than the precise concrete implementation. Multiple concrete classes could potentially implement the dataframe protocols - indeed I'd hope that any core.matrix array should work as a dataframe (it might be unlabelled, of course).

This aspect gets especially important with very large / sparse datasets where you really want a specialised underlying data representation.

Honestly, I think that core.matrix already has most of the protocols needed. I added some extra labelling protocols in this commit: https://github.com/mikera/core.matrix/commit/9757e738e18894182d5624e7739c29d0c6560bb5

drone29a commented 9 years ago

Completely agree, the abstraction is where it's at.

I didn't know about Vega until the recent mailing list discussion, will check it out. A simple API for high-quality figures is more important to me than interactive graphics, but I think it's possible to do both with a single library. Along with Vega, I'll take a closer look at ggvis.

my-R-help commented 9 years ago

Just a minor comment @mattrepl , since you mentioned the reshape package: You should take a look at the reshape2 package instead (which does less than the reshape packge, but is more focused), or even the tidyr package, which again does less than reshape2, but is more focused. The reshape package is legacy.