Composable Folds and map-reduce

acowley / Frames

Data frames for tabular data.

Other

298 stars 41 forks source link

Composable Folds and map-reduce #117

Open o1lo01ol1o opened 6 years ago

o1lo01ol1o commented 6 years ago

As discussed in the dataHaskell gitter, a (composable) version of split-apply-combine or map-reduce would be a welcome addition to the frames api. For clarity, here's a comment outlining the desiderata:

By way of a simple example, let's take the Iris data and add an additional Int :-> "AgeOfPlant" column. Now say I wanted to cacluate 1) the std of SepalWidth of all samples with the same values of AgeOfPlant and Species and 2) each of those calculated values divided by the dataset std of SepalWidth.

For 1), we would have one Fold (from Foldl) expressing the std for each set of records indexed by the unique values of AgeOfPlant and Species and one Fold expressing the traversal needed to construct each of these sets. For 2), we would have a single Fold expressing the std in the usual way. (I believe we can then sequenceA and join those to calculate the reduction in a single traversal, but it's been a bit since I used Foldl, so maybe I'm wrong.)

For the sake of feature exposition, we should then do a left join of that reduction back onto the original dataset by AgeOfPlant and Species so we have the group statistics ready to be further aggregated in Folds.

This could of course be written ad hoc as needed, but it's frequent enough that is has it's own name, in pandas and R: split-apply-combine, and more generally as map-reduce. It probably deserves its own declarative abstraction in Frames if only to save the keystrokes of writing it for every exploration operation. It would also be worth benchmarking against the pandas equivalent, there's a chance the single traversals of Fold would generate significant performance increases over the pandas versions (assuming the monoidal structure of the groups is exploited in parallel in Foldl).

The titanic dataset provides enough categorical variables to test this. Let's take the above example for Age, pclass and find the standard deviation for survival per group divided by the dataset standard deviation.

acowley commented 6 years ago

The first thing I wanted to do was quite invasive, but it crossed out a TODO item of mine that was something like three years old: infer types appropriate for categorical variables.

There is a test case based on this module that demonstrates how to use it.

The generated types have Enum and Bounded instances (among others), so should be very useful for efficiently comparing and grouping.

Some questions regarding categorical variables:

I currently fall back to Text after 8 distinct variants. This is not terribly hard for users to change, but is 8 a good default?
Should we update tableTypes to generate these things? I didn't do so yet since it will break existing code. There's also the issue of false positives: sometimes you'd prefer Text values, and overriding that choice for one column is not as easy as one might hope.
The naming isn't great. Consider the test example where we have a column titled month. In the old Frames world, this would give us a column type of type Month = "month" :-> Text. But now we want to generate a custom data type for the value part of the column (i.e. it has a name, "month", and a value, Text), what should we call it? Not Month, since that's the name of the column, so I prefix the row name on the column name, giving us data RowMonth = ....

Now consider the individual variants. Many data files use abbreviated names for categorical variables, so introducing them as top-level identifiers without much change would seem to invite a tremendous amount of name collision. To avoid this, I prefix the data constructor names by the data type name, giving us things like RowMonthJanuary. That's a mouthful, but I still haven't thought of a better way. The good part is that you can turn on OverloadedStrings and write out the value as it's written in the data file (e.g. "January" :: RowMonth == RowMonthJanuary). When you're pattern matching, hopefully you have auto-completion or automatic case splitting in your editor.
Should categorical variants be case sensitive? They are now. Probably this should be a overridable, but what's the right default?

djhogan commented 6 years ago

My 2 cents (in order of your previous questions):

I don't think there's a wrong answer here, but speaking from personal experience, the number of categorical variables can vary widely. Consider a dataset of people's heights, where each row contains the name of the person and a measurement, and there are replicate measurements for each person. A typical use case would be to group by the name of the person and calculate the mean of the measurements.

However, I may be conflating a Text variable and a categorical variable here. I suppose a categorical variable should have a known domain (e.g. month of year) so they can be efficiently encoded and validated. Otherwise, what are the advantages of categorical variables over Text?

Personally, I'm now of the mind that we shouldn't automatically generate categorical variables. With the month example, if a month value is absent from the data, the generated type will be missing a valid value.
I don't really have any thoughts here.
Absent any other reason, I think we should make it case sensitive. Off the top of my head I can think of one representation that prefers case-sensitivity. Genotypes are often denoted with lower and upper case letters where lower case denotes a recessive allele and super case denotes a dominant allele. I know this is a rather specific use-case, but in the absence of any other strong argument, this might be enough of a reason.

o1lo01ol1o commented 6 years ago

Wrt to the question of generation of categorical variables, I think one should provide some simple inference but expect that the user will want to specify the domain in a sum type or as an open sum of Text. For example, if I load up a random dataset, it would be nice to be able to get ghci to give the inferred column types with generated categorical placeholders so I know where I need to inspect the domain of values. Chances are that I’d then roll my own sum types for small cardinality and rely on text the rest of the time.

In either case, something I miss frequently in pandas/sklearn is the ability to define an “other” category. This comes up frequently when you have an ML pipeline that suddenly gets an unseen categorical value during inference. Most times you’d just want to map it to “other” as opposed to retraining a model with the new category.

Sent from my iPhone

On Aug 11, 2018, at 5:29 PM, Daniel Hogan notifications@github.com wrote:

My 2 cents (in order of your previous questions):

I don't think there's a wrong answer here, but speaking from personal experience, the number of categorical variables can vary widely. Consider a dataset of people's heights, where each row contains the name of the person and a measurement, and there are replicate measurements for each person. A typical use case would be to group by the name of the person and calculate the mean of the measurements. However, I may be conflating a Text variable and a categorical variable here. I suppose a categorical variable should have a known domain (e.g. month of year) so they can be efficiently encoded and validated. Otherwise, what are the advantages of categorical variables over Text?

Personally, I'm now of the mind that we shouldn't automatically generate categorical variables. With the month example, if a month value is absent from the data, the generated type will be missing a valid value.

I don't really have any thoughts here.

Absent any other reason, I think we should make it case sensitive. Off the top of my head I can think of one representation that prefers case-sensitivity. Genotypes are often denoted with lower and upper case letters where lower case denotes a recessive allele and super case denotes a dominant allele. I know this is a rather specific use-case, but in the absence of any other strong argument, this might be enough of a reason.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

adamConnerSax commented 5 years ago

Not sure if this is helpful, but I've been working toward pieces of this and the general part (mostly just wrappers around Control.Foldl) is here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Control/MapReduce/Core.hs with a some simpler interfaces and helpers here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Control/MapReduce/Simple.hs and a frames specific interface here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Frames/MapReduce.hs There are some first stabs at using control.parallel.strategies as well.

Some examples are here: https://github.com/adamConnerSax/Frames-utils/blob/master/examples/MapReduce.hs#L125

The types are a little atrocious but that's to allow a lot of generality along a few axes:

The unpacking and assigning types (the parts of what you are calling "split") which basically map a row into more or fewer (filtering) rows and also select key and data columns for grouping. Also, for Frames I need to account for the fact that the row type depends on the columns. This would all simplify quite a bit if all the row types were the same.
Grouping can use a variety of intermediate structures. The end user doesn't need to know but flexibility here is helpful for optimization.
You may have a monadic unpacking (something like a bootstrap) or reduction (e.g., initial guesses for kMeans) so I support both cases.
Keys may be Ord or Hashable and I want to handle both. So that gets carried around as a type variable.
For parallel map-reduce, you need an NFData constraint in some funny places so that possibility gets carried around as type variable.

A tangent on the categorical variable thing: would it be possible/easy (my TH is very rusty and was never very good!) to optionally create new column types for each categorical value? This would make "one-hot" encoding very simple. That is, if your categorical variable is called "Pet" and has possible values "Dog", "Cat", "Hamster", you would effectively also do

declareColumn "PetDog" 'Bool
declareColumn "PetCat" 'Bool
declareColumn "PetHamster" 'Bool

instance OneHot Pet where
   type OneHotCols Pet = '[PetDog,PetCat,PetHamster]
   oneHot :: Snd Pet -> Record OneHotCols

where the oneHot function does the obvious thing of putting True in the matching column and False in the rest. I think Int (using 1 or 0) might be easier for a number of learning models but that seems silly and can be handled pretty straightforwardly at the interface to the regression or whatever.