Open benjamin-robbins opened 6 years ago
DataFrames cookbook for Pandas: https://pandas.pydata.org/pandas-docs/stable/cookbook.html
There are a number of limitations to the DataFrame class (though in general it is awesome). Here are few pieces of functionality we should call out.
Things It Does, Please Keep
transpose(X).dot(Y)
.rowMin/Max
, argMin/Max
on the underlying matrixThings They Don't Do But Really Should
join
in Python, but means "use the columns of A and rows of B that match when multiplying AB"(y,x1, x2, x3... ,xn)
Randomize the rows under a non-uniform regime.We have done much of this in NumSuch NamedMatrix
class. We're happy to provide use cases. We are constantly developing new requirements, so expect this to get kinda flooded. The point is we are active users and can keep you from being 3 years behind (like pandas is getting)
Quick historical perspective. The Pandas Dataframe was modeled largely after the R dataframe, which was a bundle of pure awesomeness around the turn of the century. This was mostly useful in a REPL environment so statisticians could more easily manipulate their data and make real-time decisions.
My feeling is that Pandas dataframes are used largely within the context of Jupyter Notebooks. I do not see them used for data management, but I may be missing those use cases.
Spark RDDs took a different approach and went "distributed first". They cleverly integrated REPL and persistence workflows. However, Spark has moved on to their own version of Spark Dataframes which are geared more directly towards persistence and kind of acting like a mini SQL database. Since Spark has always thought massive data first, this made sense for them.
So I would encourage the team to decide if they are satisfying a REPL need (like Pandas) or a more general data management need (like Spark Dataframes). I love the promise of both, personally. However, Chapel as a language seems more biased towards the latter.
Just to agree with @buddha314’s comments and emphasize the difference between R/pandas approaches and Spark’s that is most important to me: Spark went distributed-first and, vitally, included a number of powerful primitives to make a wide range of distributed computations on them feasible. These primitives are very different than what you’d have on a distributed mutable array, even an array of records, because the use case is different.
R dataframes (and, mystifyingly, even later versions like the tibble) don’t have those affordances and as a result one struggles to do anything across nodes with R without manually sharding ones data structures and more or less postponing any combine operation until the end of the computation. Pandas mainly outsources such considerations to dask or ray, with mixed success.
Those primitives are vital. Spark’s are appropriate for immutable dataframes; they may or may not be the appropriate choice for a Chapeltastic design (for instance, I don’t think Chapel has aspirations for fault-tolerance). But at the least one must be able to use whatever primitives are implemented to do the same range of computations that Spark can.
Thanks for all the information!
@buddha314 : In the case of a sparse DataFrame, are all of the elements of the same type? Also, it seems like a common DataFrame operation is to dynamically add and remove columns. If columns could not be added or removed then we could possibly store data in some optimized compressed format. Otherwise I'm not sure what the internal representation would look like (for performance).
@ljdursi : Can you elaborate on the Spark primitives? Are they arithmetic operators, reductions, rebalancing methods, or ...?
I also have never seen pandas in a module. To my knowledge, Pandas exists purely as Juypter(REPL) tool. If Chapel really does want to recreate Pandas, developing a productive REPL and/or jupyter kernel would be a strict prerequisite to giving Chandas (hah) any purpose at all. That said, like @buddha314 and @ljdursi mentioned, this would all seem a digression from Chapel's main goals as a language.
@benharsh We have certainly come across multiple use cases where the data is/not homogenous. In R, you can have any type you want (also in Pandas). This is good when you're dealing with "is height correlated to ethnicity" kinds of things. So a row is 1.7m, causcasian
In the REPL, you can create factors from the strings and do your modeling.
I don't recall specifically is Spark makes it quite that easy to go string -> factor -> model.
Importantly, when you get to large tables of mixed type, you're probably best off putting the data in a database. Of course, you can do any blend you like in SQL.
Then of course there is a simpler view where you have some sugar on basic real matrices. In that case, you want to use the dataframe like a basic matrix, but address it via the labeled rows and columns. Between a named matrix and a full dataframe, Python and Spark both have a notion of labeled points. That is, a matrix with a row that is the y
value separate from the x
value. In this way, you can sample from your labeled data and keep it all together. Kinda like in here: https://spark.apache.org/examples.html
For the record, the REPL is extremely important to analysis.
Hi @benharsh - yes, for the underlying RDDs they are generalizations of map/reduce plus some other things:
- figure from https://training.databricks.com/visualapi.pdf
(The distinction between transformations and actions is that transformations are queued up lazily, actions trigger computations). "Narrow" operations like map are trivial, it's the "Wide" operations that cross shards, like the reduction operations, which need some thought.
For dataframes/datasets, as @buddha314 has suggested, there are additional things like SQL-like access, imports/exports, and schema operations.
Is there a set intersection of features between RDDs, Dataframes, and HDF5 for storage? Could HDF5 serve as the disk storage for something like an RDD or Dataframe? If HDF5 + FastQuery is insufficient, could an existing columnar storage technology like MonetDB be used as a backend to an RDD or Dataframe?
Someone at LBL figured out how to map SQL style operations over HDF5 tables, they call it FastQuery. The result is something that has features similar in form to query operations provided by RDDs and Dataframes. Below are some papers that describe the work - if you google around, there are more papers out there (it's a whole series of studies out of LBL):
http://vis.lbl.gov/Events/SC05/HDF5FastQuery/index.html http://vis.lbl.gov/~kurts/research/ssdbm2006-hdf5-fastquery.pdf
https://sdm.lbl.gov/~kewu/ps/LBNL-5315E.html https://ieeexplore.ieee.org/document/1644309/
FastQuery is opensource and available here:
https://github.com/albert7997/FastQuery
FastQuery uses the fastbit software for managing queries. fastbit is available here:
https://github.com/gingi/fastbit
As for HDF5, a really nice C++ interface has been developed. It's called HighFive. It might be worth consideration for building out an HDF5 library that could then be used to implement a dataframe or RDD using the FastQuery concept.
https://github.com/BlueBrain/HighFive
More details on MonetDB can be found here:
The vision is to connect DataFrames to HDF5 in the near future, yes. @daviditen has been working on the HDF5 effort: #8640
I'm not sure what our plans are with regards to Resilient Distributed Datasets (RDD).
Pandas 2.0 design document contains a lot of learned lessons from the initial Pandas design, and is probably worth a read for anyone doing dataframes development.
As a Chapel Programmer, I want to be able to use a DataFrame (similar in basic functionality like Pandas for python) in my Chapel application so that I can manipulate my data easily with Chapel.
Acceptance Criteria: User loads data and is able to manipulate it through the DataFrame