Open o1lo01ol1o opened 6 years ago
The first thing I wanted to do was quite invasive, but it crossed out a TODO item of mine that was something like three years old: infer types appropriate for categorical variables.
There is a test case based on this module that demonstrates how to use it.
The generated types have Enum
and Bounded
instances (among others), so should be very useful for efficiently comparing and grouping.
Some questions regarding categorical variables:
I currently fall back to Text
after 8 distinct variants. This is not terribly hard for users to change, but is 8 a good default?
Should we update tableTypes
to generate these things? I didn't do so yet since it will break existing code. There's also the issue of false positives: sometimes you'd prefer Text
values, and overriding that choice for one column is not as easy as one might hope.
The naming isn't great. Consider the test example where we have a column titled month
. In the old Frames
world, this would give us a column type of type Month = "month" :-> Text
. But now we want to generate a custom data type for the value part of the column (i.e. it has a name, "month"
, and a value, Text
), what should we call it? Not Month
, since that's the name of the column, so I prefix the row name on the column name, giving us data RowMonth = ...
.
Now consider the individual variants. Many data files use abbreviated names for categorical variables, so introducing them as top-level identifiers without much change would seem to invite a tremendous amount of name collision. To avoid this, I prefix the data constructor names by the data type name, giving us things like RowMonthJanuary
. That's a mouthful, but I still haven't thought of a better way. The good part is that you can turn on OverloadedStrings
and write out the value as it's written in the data file (e.g. "January" :: RowMonth == RowMonthJanuary
). When you're pattern matching, hopefully you have auto-completion or automatic case splitting in your editor.
Should categorical variants be case sensitive? They are now. Probably this should be a overridable, but what's the right default?
My 2 cents (in order of your previous questions):
However, I may be conflating a Text variable and a categorical variable here. I suppose a categorical variable should have a known domain (e.g. month of year) so they can be efficiently encoded and validated. Otherwise, what are the advantages of categorical variables over Text?
Personally, I'm now of the mind that we shouldn't automatically generate categorical variables. With the month example, if a month value is absent from the data, the generated type will be missing a valid value.
I don't really have any thoughts here.
Absent any other reason, I think we should make it case sensitive. Off the top of my head I can think of one representation that prefers case-sensitivity. Genotypes are often denoted with lower and upper case letters where lower case denotes a recessive allele and super case denotes a dominant allele. I know this is a rather specific use-case, but in the absence of any other strong argument, this might be enough of a reason.
Wrt to the question of generation of categorical variables, I think one should provide some simple inference but expect that the user will want to specify the domain in a sum type or as an open sum of Text. For example, if I load up a random dataset, it would be nice to be able to get ghci to give the inferred column types with generated categorical placeholders so I know where I need to inspect the domain of values. Chances are that I’d then roll my own sum types for small cardinality and rely on text the rest of the time.
In either case, something I miss frequently in pandas/sklearn is the ability to define an “other” category. This comes up frequently when you have an ML pipeline that suddenly gets an unseen categorical value during inference. Most times you’d just want to map it to “other” as opposed to retraining a model with the new category.
Sent from my iPhone
On Aug 11, 2018, at 5:29 PM, Daniel Hogan notifications@github.com wrote:
My 2 cents (in order of your previous questions):
I don't think there's a wrong answer here, but speaking from personal experience, the number of categorical variables can vary widely. Consider a dataset of people's heights, where each row contains the name of the person and a measurement, and there are replicate measurements for each person. A typical use case would be to group by the name of the person and calculate the mean of the measurements. However, I may be conflating a Text variable and a categorical variable here. I suppose a categorical variable should have a known domain (e.g. month of year) so they can be efficiently encoded and validated. Otherwise, what are the advantages of categorical variables over Text?
Personally, I'm now of the mind that we shouldn't automatically generate categorical variables. With the month example, if a month value is absent from the data, the generated type will be missing a valid value.
I don't really have any thoughts here.
Absent any other reason, I think we should make it case sensitive. Off the top of my head I can think of one representation that prefers case-sensitivity. Genotypes are often denoted with lower and upper case letters where lower case denotes a recessive allele and super case denotes a dominant allele. I know this is a rather specific use-case, but in the absence of any other strong argument, this might be enough of a reason.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Not sure if this is helpful, but I've been working toward pieces of this and the general part (mostly just wrappers around Control.Foldl) is here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Control/MapReduce/Core.hs with a some simpler interfaces and helpers here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Control/MapReduce/Simple.hs and a frames specific interface here: https://github.com/adamConnerSax/Frames-utils/blob/master/src/Frames/MapReduce.hs There are some first stabs at using control.parallel.strategies as well.
Some examples are here: https://github.com/adamConnerSax/Frames-utils/blob/master/examples/MapReduce.hs#L125
The types are a little atrocious but that's to allow a lot of generality along a few axes:
A tangent on the categorical variable thing: would it be possible/easy (my TH is very rusty and was never very good!) to optionally create new column types for each categorical value? This would make "one-hot" encoding very simple. That is, if your categorical variable is called "Pet" and has possible values "Dog", "Cat", "Hamster", you would effectively also do
declareColumn "PetDog" 'Bool
declareColumn "PetCat" 'Bool
declareColumn "PetHamster" 'Bool
instance OneHot Pet where
type OneHotCols Pet = '[PetDog,PetCat,PetHamster]
oneHot :: Snd Pet -> Record OneHotCols
where the oneHot function does the obvious thing of putting True in the matching column and False in the rest. I think Int (using 1 or 0) might be easier for a number of learning models but that seems silly and can be handled pretty straightforwardly at the interface to the regression or whatever.
As discussed in the
dataHaskell
gitter, a (composable) version ofsplit-apply-combine
ormap-reduce
would be a welcome addition to the frames api. For clarity, here's a comment outlining the desiderata:The titanic dataset provides enough categorical variables to test this. Let's take the above example for
Age
,pclass
and find the standard deviation forsurvival
per group divided by the dataset standard deviation.