JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 157 forks source link

Circular Ordered Factor #620

Open leonardtschora opened 4 years ago

leonardtschora commented 4 years ago

Hi,

It is my understanding that there is a ContinuousEncoder available that maps OrderedFactor data to a one-hot encoded representation or a Continuous representation, but it does not support Circular Ordered Factor.

A Circular Ordered Factor data is a data where the first and last elements have to be considered as consecutive.

An example of a Circular Ordered Factor data is the hour of the day : it ranges from 0 to 23, but the hours 23 and 0 are as close as the hours 12 to 13 for isntance. Hence it would be a mistake to consider them as regular OrderedFactor (we would lost the cyclical information).

There are two ways to encode such data:

Let me know if such feature is already available and I did not see it, or if you want me to submit a PR and add this functionality. Thanks a lot for your support!

ablaom commented 4 years ago

There is no current functionality around cyclic data but adding support is an excellent suggestion. Here are some of the steps involved to add support, as I see them:

  1. Add two new scitypes CyclicFactor <: Finite and CyclicContinuous <: Continuous at ScientificTypes.jl (readily done)

  2. Decide how such scientific types should be represented to MLJ methods/models (mostly "static" transformers, I expect). This becomes part of the scitype convention implemented at MLJScientificTypes. This requires further discussion; see below.

  3. Add transformers for converting cyclic variables into forms consumable by supervised learning models - as you say, these would include conversion to harmonic representations (two or more Fourier components). I'm not familiar with any one-hot encoding method adapted for cyclic variables. Could you provide a bit more detail or reference?

Regarding 2: The trickier case is probably CyclicFactor. Some context: In a scientific type convention, a given machine type (eg, Int64) cannot simultaneously represent two different scitypes (eg both Count and OrderedFactor). We therefore need to find a representation that is not already in use for one of the other scitypes. In the spirit of OrderedFactor I suggest we use a "labelled" representation (as we do for OrderedFactor), and of course, all levels should be preserved in vectors under resampling (https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/#Tracking-all-levels-1). The best case scenario would be for CategoricalArrays to generalise their ordered/unordered dichotomy to ordered/unordered/cyclic. (We already use ordered CategoricalValue for OrderedFactor and unordered CategoricalValue for Multiclass). So we should how feasible this would be. I will open an issue.

CyclicContinuous use-cases are probably less. Perhaps we need a new type that wraps a complex number with unit modulus??

Thoughts?

(Minor etymological objection: I would not call things "cyclic ordered factor". From a technical (mathematical) point of view, there is no order on the value of the type, only an order on a covering space. The key property is that the set of values supports a free and transitive action of a finite cyclic group or R - but any name extracted from that is likely to give most data scientists indigestion 😄 )

leonardtschora commented 4 years ago

Thanks for taking this suggestion in consideration.

I don't think I will be able to answer the questions about how to represent cyclical data with SicentificTypes, having discovered the MLJ philosophy 1 week ago. However, here is some additional material concerning the encoding of cyclical data and more particularly circular encoding:

image Taken from Machine Learning on Epex Order Books page 7.

There is also this that discusses it.

Let me know if you need more content.

ablaom commented 4 years ago

Thanks. This is clear to me.