JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 33 forks source link

Add support for "cyclic" categorical variables, such as month of year. #287

Open ablaom opened 4 years ago

ablaom commented 4 years ago

Context: https://github.com/alan-turing-institute/MLJ.jl/issues/620

I wonder if it would be feasible to add to the "ordered" and "unordered" cases a third "cyclic" category. Instead of "<" being defined, we would have new behaviour which allows one to add or subtract an integer from any CategoricalValue. So, roughly, something like this:

v = categorical(["april", "january", "april"], kind=:cyclic)   # or ... `cyclic=true`, with `ordered` being ignored 
levels!(v, ["january", "february", ..., "december"])
julia> v[1] + 13
CategoricalValue{String,UInt32} "may"

Perhaps one would need to worry about two pools being regarded as equivalent if the levels of one is a cyclic permutation of the the other, and so forth.

How hard would this be?

bkamins commented 4 years ago

I think that having this in the ecosystem would be super nice. However, my first reaction is that it would be more natural to have a separate mini-package providing this functionality.

ablaom commented 4 years ago

I think I mustn't understand your statement properly. It seems to me that cyclic categorical arrays would - apart from what I outline above - share all the behaviour of the existing arrays (indexing - a pool of all possible values, and so forth). So, how would it be more natural to have a separate mini package to do this? And how would we reproduce all this behaviour without either making some changes to CategoricalArrays itself, or replicating most of it elsewhere?

bkamins commented 4 years ago

You asked for something like:

v[1] + 13

to work which is something that CategoricalValue supports and I personally do not think it should.

v[1] should be of a different type in my opinion, something like CyclicValue that will clearly signal its nature. Then you can also dispatch on it.

Also - if I understand the use-case correctly - most of the time you will know the pool of values you want to cycle over when creating the container (like week day names, month names etc.) and keep them fixed (which is opposite to what CategoricalArrays.jl does - as it is flexible here - which makes sense for non-cyclic factors - but I feel that for cyclic factors this pool should be kept fixed).

What could be added that we do not have now, is to define AbstactCategoricalValue that then both CategoricalValue and CyclicValue would subtype, as I agree that in some contexts it just matters that it is something that is categorical.

bkamins commented 4 years ago

As an additional comment - such a functionality could be a type that is just a thin wrapper over CategoricalArrays.jl as indeed maybe a lot of machinery can be reused internally.

ablaom commented 4 years ago

Thanks for that. I was hoping to avoid a wrapper.

v[1] should be of a different type in my opinion, something like CyclicValue that will clearly signal its nature. Then you can also dispatch on it.

I completely agree, and I need this "different nature". In fact, by the same argument, we should have different types OrderedValue and PlainValue. (Or , as I proposed a while back, we should have a type parameter which takes one of these values: Ordered, Plain, and now, Cyclic).

I still think having the inconsistency of immutable cyclic pools - versus introducing a wrapper - is the lesser of two evils.

I imagined the main argument against what I am proposing is that it would be difficult to implement without breaking the existing API. We could simply add a new Bool attribute cyclic, and then ignore ordered if cyclic=true but that's a hack. I just wondered if someone else had a better idea (that avoids wrapping).

Side note. From the point-of-view of supervised machine learning (as opposed to data wrangling) I have often lamented the mutability of the pools. Perhaps we can have an ImmutableCategoricalArrays package with a type parameter for cyclic/ordered/plain and a type parameter for the number of levels. If there were 48 hours in a day....

bkamins commented 4 years ago

I feel that I would have just written the same.

However, as you probably guessed my earlier comment was from the perspective of package design and how expensive it would be to add what you ask for to the current design / how hard it would be to maintain it. And my feeling was that it would be hard. However, let us wait for @nalimilan to comment as he might have some better ideas on the implementation side.

I have often lamented the mutability of the pools

While I agree with this my only question is - if we agree to have immutable pools how would you want to handle a situation in which on production a new level pops-up for prediction? (this can be handled easily in dev/validation/test as you know the whole pool you got for model building but in production it is often the case you can get new levels - especially for nominal categorical variables)

nalimilan commented 4 years ago

Interesting. I had never encountered cyclic categorical variables before. Do you have more references/examples, apart from hour/month? How common are they?

One issue I see is that hours/months are not just categorical variables: it's also numeric in the sense that you know the difference between subsequent categories is always 1. Actually you could even treat them as continuous (10.5h is perfectly well defined). Do you also need a type (in another package) to handle continuous cyclic variables?

I think we need to distinguish three issues:

bkamins commented 4 years ago

Do you have more references/examples, apart from hour/month? How common are they?

Of other examples I have been building models for wind direction prediction where it is natural to have an angle that is in range [0, 2pi] and is cyclic.

Of other things - I think these are great comments I will gladly help here, but I guess @nalimilan should have a lead in decisions (as he has best knowledge of the internals 😄).

ablaom commented 4 years ago

Do you have more references/examples, apart from hour/month?

In scientific applications these are potentially all over the place - everywhere you discretise a cyclic variable - so everywhere you have a periodic function of time or an angle of some kind.

Whether we should add more methods specific to cyclic categorical variables. This includes notably v + 1 in your proposal. Is this really necessary for you?

Actually, no, I don't need it. I mention it is because it is very natural and you don't have this for ordered factors (without "overflowing" in a non-meaningful way). For our immediate use-cases, it is probably sufficient just to have a cyclic categorical value distinguishable from the others (through type, but trait will do). Our main use-case is transformers that take a cyclic variable and convert it into two or more continuous features using trigonometric functions (see the issue prompting our request. ) This is a pretty standard operation in time-series analysis (eg temperature) where you expect cyclic variations over a single day (say), but only discrete hours are in the record. We want to allow the user to do things like "trigonometricize" all cyclic features in a table.

If we do that, < and == for pools should be adjusted for these.

We probably do want an appropriate re-implementation of == (so the pool with levels ["jan", "feb", ..., "nov", "dec"] is regarded as the same as ["feb", "mar", ..., "dec", "jan"] . Applying < should throw an error.

Whether the ordered or cyclic properties should be a type parameter, and whether the pools should be immutable. These are orthogonal discussions that I'd rather keep separate (at #184).

I don't think I have anything new to add there, thanks.

ablaom commented 4 years ago

@bkamins

While I agree with this my only question is - if we agree to have immutable pools how would you want to handle a situation in which on production a new level pops-up for prediction?

If new levels appear in an input feature to a supervised learning problem in production then one strategy is to (i) reflag them as "missing", and (ii) apply a pre-learned imputation strategy (replace with the training mode, in the simplest case), and (iii) include both steps as the first part of your overall pipeline (composite model). A few supervised models are specifically designed to deal with missing values, in which case (ii) is unnecessary. But you make a good point here: Under the assumption of immutability, new categorical vectors will need to be created in step (ii) because you must delete the newly appearing levels from the pool (not just replace their occurrences with missing.)

Thankfully, thanks to CategoricalValues, we don't have to worry about new levels appearing in a test/validation set, since the train sets know all the levels!

ablaom commented 4 years ago

Do you also need a type (in another package) to handle continuous cyclic variables?

Yes, we also want this. My initial suggestion is to use complex numbers, but they would have to be wrapped to enforce the constraint that they have unit modulus (lie on the unit circle). This would be very natural because exponentiation would immediately deliver the trigonmetric representation we are after. This might already exist somewhere, I haven't really looked.

bkamins commented 4 years ago

If new levels appear in an input feature to a supervised learning problem in production then one strategy is to

I agree :).

we don't have to worry about new levels appearing in a test/validation set, since the train sets know all the levels!

I do not know the MLJ pipelines in detail, but oftentimes test data set is taken from a different table than train and validation one.

nalimilan commented 4 years ago

OK, feel free to make a PR if you want to add support for marking a CategoricalPool/CategoricalValue/CategoricalArray as cyclic. Probably the best way to implement this is to use an enum instead of the ordered field. Not sure about the public API: should we use something like ordered!(x, CYCLIC) for that, or add another function? Since we have isordered returning a Bool, maybe better add another function.