Open leonardtschora opened 4 years ago
There is no current functionality around cyclic data but adding support is an excellent suggestion. Here are some of the steps involved to add support, as I see them:
Add two new scitypes CyclicFactor <: Finite
and CyclicContinuous <: Continuous
at ScientificTypes.jl (readily done)
Decide how such scientific types should be represented to MLJ methods/models (mostly "static" transformers, I expect). This becomes part of the scitype convention implemented at MLJScientificTypes. This requires further discussion; see below.
Add transformers for converting cyclic variables into forms consumable by supervised learning models - as you say, these would include conversion to harmonic representations (two or more Fourier components). I'm not familiar with any one-hot encoding method adapted for cyclic variables. Could you provide a bit more detail or reference?
Regarding 2: The trickier case is probably CyclicFactor
. Some context: In a scientific type convention, a given machine type (eg, Int64) cannot simultaneously represent two different scitypes (eg both Count
and OrderedFactor
). We therefore need to find a representation that is not already in use for one of the other scitypes. In the spirit of OrderedFactor
I suggest we use a "labelled" representation (as we do for OrderedFactor
), and of course, all levels should be preserved in vectors under resampling (https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/#Tracking-all-levels-1). The best case scenario would be for CategoricalArrays to generalise their ordered/unordered
dichotomy to ordered/unordered/cyclic
. (We already use ordered CategoricalValue
for OrderedFactor
and unordered CategoricalValue
for Multiclass
). So we should how feasible this would be. I will open an issue.
CyclicContinuous
use-cases are probably less. Perhaps we need a new type that wraps a complex number with unit modulus??
Thoughts?
(Minor etymological objection: I would not call things "cyclic ordered factor". From a technical (mathematical) point of view, there is no order on the value of the type, only an order on a covering space. The key property is that the set of values supports a free and transitive action of a finite cyclic group or R - but any name extracted from that is likely to give most data scientists indigestion 😄 )
Thanks for taking this suggestion in consideration.
I don't think I will be able to answer the questions about how to represent cyclical data with SicentificTypes, having discovered the MLJ philosophy 1 week ago. However, here is some additional material concerning the encoding of cyclical data and more particularly circular encoding:
Taken from Machine Learning on Epex Order Books page 7.
There is also this that discusses it.
Let me know if you need more content.
Thanks. This is clear to me.
Hi,
It is my understanding that there is a ContinuousEncoder available that maps OrderedFactor data to a one-hot encoded representation or a Continuous representation, but it does not support Circular Ordered Factor.
A Circular Ordered Factor data is a data where the first and last elements have to be considered as consecutive.
An example of a Circular Ordered Factor data is the hour of the day : it ranges from 0 to 23, but the hours 23 and 0 are as close as the hours 12 to 13 for isntance. Hence it would be a mistake to consider them as regular OrderedFactor (we would lost the cyclical information).
There are two ways to encode such data:
With n the number of distinct elements of the Circular Ordered Factor collection and x a data point to encode.
Let me know if such feature is already available and I did not see it, or if you want me to submit a PR and add this functionality. Thanks a lot for your support!