DataFrames in Chapel - Githubissues

benjamin-robbins commented 6 years ago

As a Chapel Programmer, I want to be able to use a DataFrame (similar in basic functionality like Pandas for python) in my Chapel application so that I can manipulate my data easily with Chapel.

Acceptance Criteria: User loads data and is able to manipulate it through the DataFrame

psahabu commented 6 years ago

DataFrames cookbook for Pandas: https://pandas.pydata.org/pandas-docs/stable/cookbook.html

buddha314 commented 6 years ago

There are a number of limitations to the DataFrame class (though in general it is awesome). Here are few pieces of functionality we should call out.

Things It Does, Please Keep

The ability to select rows and columns by name. E.g. for a matrix of colors by animals, get the values in X.get("blue", "giraffe")
Address the DataFrame as a matrix under appropriate conditions. For instance transpose(X).dot(Y) .
rowMin/Max, argMin/Max on the underlying matrix
Populate X from a SQL call
Persist X with a SQL call
Persist into MatrixMarket format on a variety file systems.
pretty print in both the console and in Jupyter

Things They Don't Do But Really Should

Natively sparse. No need to cast to dense for any reason.
Iterate over dimension in dense and sparse context
Calculate the sparsity
Calculate the number of non-zeroes
Support a variety of tropical operations
Align two matrices via the names conforming dimensions. This is seen as a join in Python, but means "use the columns of A and rows of B that match when multiplying AB"
Methods to sub-sample a matrix. Especially if the matrix is being used as a set of labels in supervised learning. That is, the form (y,x1, x2, x3... ,xn) Randomize the rows under a non-uniform regime.
Spark RDDs have a number of bitchin persistence functions, so you can operate on the matrix even if some of it has to be pushed to disk. Those WOULD BE AWESOME.

We have done much of this in NumSuch NamedMatrix class. We're happy to provide use cases. We are constantly developing new requirements, so expect this to get kinda flooded. The point is we are active users and can keep you from being 3 years behind (like pandas is getting)

buddha314 commented 6 years ago

Quick historical perspective. The Pandas Dataframe was modeled largely after the R dataframe, which was a bundle of pure awesomeness around the turn of the century. This was mostly useful in a REPL environment so statisticians could more easily manipulate their data and make real-time decisions.

My feeling is that Pandas dataframes are used largely within the context of Jupyter Notebooks. I do not see them used for data management, but I may be missing those use cases.

Spark RDDs took a different approach and went "distributed first". They cleverly integrated REPL and persistence workflows. However, Spark has moved on to their own version of Spark Dataframes which are geared more directly towards persistence and kind of acting like a mini SQL database. Since Spark has always thought massive data first, this made sense for them.

So I would encourage the team to decide if they are satisfying a REPL need (like Pandas) or a more general data management need (like Spark Dataframes). I love the promise of both, personally. However, Chapel as a language seems more biased towards the latter.

ljdursi commented 6 years ago

Just to agree with @buddha314’s comments and emphasize the difference between R/pandas approaches and Spark’s that is most important to me: Spark went distributed-first and, vitally, included a number of powerful primitives to make a wide range of distributed computations on them feasible. These primitives are very different than what you’d have on a distributed mutable array, even an array of records, because the use case is different.

R dataframes (and, mystifyingly, even later versions like the tibble) don’t have those affordances and as a result one struggles to do anything across nodes with R without manually sharding ones data structures and more or less postponing any combine operation until the end of the computation. Pandas mainly outsources such considerations to dask or ray, with mixed success.

Those primitives are vital. Spark’s are appropriate for immutable dataframes; they may or may not be the appropriate choice for a Chapeltastic design (for instance, I don’t think Chapel has aspirations for fault-tolerance). But at the least one must be able to use whatever primitives are implemented to do the same range of computations that Spark can.

benharsh commented 6 years ago

Thanks for all the information!

@buddha314 : In the case of a sparse DataFrame, are all of the elements of the same type? Also, it seems like a common DataFrame operation is to dynamically add and remove columns. If columns could not be added or removed then we could possibly store data in some optimized compressed format. Otherwise I'm not sure what the internal representation would look like (for performance).

@ljdursi : Can you elaborate on the Spark primitives? Are they arithmetic operators, reductions, rebalancing methods, or ...?

Tshimanga commented 6 years ago

I also have never seen pandas in a module. To my knowledge, Pandas exists purely as Juypter(REPL) tool. If Chapel really does want to recreate Pandas, developing a productive REPL and/or jupyter kernel would be a strict prerequisite to giving Chandas (hah) any purpose at all. That said, like @buddha314 and @ljdursi mentioned, this would all seem a digression from Chapel's main goals as a language.

buddha314 commented 6 years ago

@benharsh We have certainly come across multiple use cases where the data is/not homogenous. In R, you can have any type you want (also in Pandas). This is good when you're dealing with "is height correlated to ethnicity" kinds of things. So a row is 1.7m, causcasian In the REPL, you can create factors from the strings and do your modeling.

I don't recall specifically is Spark makes it quite that easy to go string -> factor -> model.

Importantly, when you get to large tables of mixed type, you're probably best off putting the data in a database. Of course, you can do any blend you like in SQL.

Then of course there is a simpler view where you have some sugar on basic real matrices. In that case, you want to use the dataframe like a basic matrix, but address it via the labeled rows and columns. Between a named matrix and a full dataframe, Python and Spark both have a notion of labeled points. That is, a matrix with a row that is the y value separate from the x value. In this way, you can sample from your labeled data and keep it all together. Kinda like in here: https://spark.apache.org/examples.html

For the record, the REPL is extremely important to analysis.

ljdursi commented 6 years ago

Hi @benharsh - yes, for the underlying RDDs they are generalizations of map/reduce plus some other things:

spark-operations - figure from https://training.databricks.com/visualapi.pdf

(The distinction between transformations and actions is that transformations are queued up lazily, actions trigger computations). "Narrow" operations like map are trivial, it's the "Wide" operations that cross shards, like the reduction operations, which need some thought.

For dataframes/datasets, as @buddha314 has suggested, there are additional things like SQL-like access, imports/exports, and schema operations.

ct-clmsn commented 6 years ago

Is there a set intersection of features between RDDs, Dataframes, and HDF5 for storage? Could HDF5 serve as the disk storage for something like an RDD or Dataframe? If HDF5 + FastQuery is insufficient, could an existing columnar storage technology like MonetDB be used as a backend to an RDD or Dataframe?

Someone at LBL figured out how to map SQL style operations over HDF5 tables, they call it FastQuery. The result is something that has features similar in form to query operations provided by RDDs and Dataframes. Below are some papers that describe the work - if you google around, there are more papers out there (it's a whole series of studies out of LBL):

http://vis.lbl.gov/Events/SC05/HDF5FastQuery/index.html http://vis.lbl.gov/~kurts/research/ssdbm2006-hdf5-fastquery.pdf

https://sdm.lbl.gov/~kewu/ps/LBNL-5315E.html https://ieeexplore.ieee.org/document/1644309/

FastQuery is opensource and available here:

https://github.com/albert7997/FastQuery

FastQuery uses the fastbit software for managing queries. fastbit is available here:

https://github.com/gingi/fastbit

As for HDF5, a really nice C++ interface has been developed. It's called HighFive. It might be worth consideration for building out an HDF5 library that could then be used to implement a dataframe or RDD using the FastQuery concept.

https://github.com/BlueBrain/HighFive

More details on MonetDB can be found here:

https://www.monetdb.org/Home

psahabu commented 6 years ago

The vision is to connect DataFrames to HDF5 in the near future, yes. @daviditen has been working on the HDF5 effort: #8640

I'm not sure what our plans are with regards to Resilient Distributed Datasets (RDD).

ben-albrecht commented 5 years ago

Pandas 2.0 design document contains a lot of learned lessons from the initial Pandas design, and is probably worth a read for anyone doing dataframes development.

chapel-lang / chapel

DataFrames in Chapel #8638