Define and collect main time series types

florian-huber commented 3 years ago

Times series are everywhere and they can include a lot of different things. To better address this field and communicate our work, it is important to structure this a bit.

This is also to look at actual time series data and check what could be relevant. Possible resources are:

florian-huber commented 3 years ago

Possible categories to consider:

univariant (only one time-dependent variable) vs. multivariant (> 1 time-dependent variable)
multivariant could also be further divided into: same-type channels (e.g. EEG -> all channels are similar type of signals) vs different-type channels
absolute time (precise position in time matters) vs relative time (translational invariance, but potentially correlated across channels --> "same time" events or events with particular distance) vs time independent
absolute channel (important in which channel something happens) vs relative channel
local pattern (e.g. specific peak) vs global pattern (frequency, variance, trend etc.)
numerical vs categorical data

So far, that list above contains some redundancies:

different-type channels also implies absolute channel (but same_type channels could lead to both)

Maybe it is also good to decide that we focus on time series classification. And f we use such categories to assess a model regarding its performance for classifying time series, we could also think of other stuff, e.g.:

number of classes ?
number and/or dimension of samples ?

florian-huber commented 3 years ago

Here a first attempt to start a table for common data types

Data type	Description	Link to example data set	multivariate / univariate	absolute/relative time	same-type/different-type	absolute/relative channel	local/global pattern
EEG	data from electrodes placed on scalp	...	multivariate	can be both	same-type	absolute channel	can be both
Wearable motion-sensor data	accelerometer and gyroscope data	...	multivariate	can be both	different-type	absolute channel?	can be both

florian-huber commented 3 years ago

Here a first attempt to start a table for specific example datasets

Dataset	Description	Link to dataset	Citation	multivariate / univariate	time structure	same-type/different-type	absolute/relative channel	local/global pattern
3W Dataset	Various sensor data to detect rare undesirable real events in oil wells	https://github.com/ricardovvargas/3w_dataset	https://doi.org/10.1016/j.petrol.2019.106223	multivariate	relative time	different-type	absolute channel	local ?
Gas sensors for home activity monitoring	MOX gas sensors, and a temperature and humidity sensor	https://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring	see link	multivariate	?	different-type	absolute channel	?
EEG Steady-State Visual Evoked Potential	EEG data	https://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals#	see link	multivariate	?	same-type	absolute channel	?
Human Activity Recognition from Continuous Ambient Sensor Data	Various "smart home" sensors	https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+from+Continuous+Ambient+Sensor+Data	see link	multivariate	?	different-type	absolute channel	?
Air Quality Data Set	Various sensor data	https://archive.ics.uci.edu/ml/datasets/Air+Quality	https://www.sciencedirect.com/science/article/abs/pii/S0925400507007691	multivariate	?	different-type	absolute channel	?

jspaaks commented 3 years ago

absolute time and relative time could probably be treated the same, by defining them as relative to an event external to the time series (e.g. the origin of the time axis, an event in another time series, an event internal to the time series, etc)
local pattern and global pattern are arbitrary, has more to do with how a process is sampled. Probably a more workable paradigm is to have users define the size of certain events with respect to time as well as with respect to what is on the vertical axis.. This would also take care of being able to deal with events of a certain duration.
numerical, categorical, etc: I believe this is referred to as 'scales' . Some other scales are ordinal, nominal, interval, ratio etc.
my feeling is that properties channels and n_ch should be kept separate of the signal|noise definitions. I'd prefer to
- define a signal/model/deterministic component for example as "linear model with intercept 4 and slope -0.3", label this for examplesignal1
- define a second signal/model/deterministic component for example as "linear model with intercept -30 and slope +3.4", label this for example signal2
- define a stochastic signal for example as "time-independent gaussian noise with mean 4.56, std dev 2.3, kurtosis 0, skewness 0", label this for example noise1
- with this interpretation of definitions, you could use names like random_walk, gaussian, etc, like we're doing now with signal_type. Each of these would need to be shorthand for an implementation somewhere (Python or elsewhere), and would take its function parameters from the corresponding yaml definition. This would mean that each of the clauses here https://github.com/epodium/time_series_generator/blob/63923207204bbb09e04ea01d0f6ccf5f7a022842/ts_generator/TS_generator.py#L268-L332 and here https://github.com/epodium/time_series_generator/blob/63923207204bbb09e04ea01d0f6ccf5f7a022842/ts_generator/TS_generator.py#L335-L362 would become an individual function whose parameters are passed by kwargs taken from the yaml.
- then you could use another section of the yaml to state that there are going to be, say, 5 channels, and define which channel has which combination of signal and noise using the labels. We would need to find a way to do definition expansion eventually, something like what we now have with channels: [1, 2, 3]. Perhaps this section of the yaml could be named composition. Or just channels.
- this will likely mean that the signal_def and noise_def can be merged into definitions. We could optionally introduce a key stochastic: bool for each definition if we need to differentiate between these 2 types of model, not sure yet.
- I need to think more on how well this all fits multivariate problems whose constituent time series are not independent of each other
Do we have any vocabulary related to sampling, e.g. equally spaced, burst, exponential backoff, state-dependent etc.?

epodium / time_series_generator

Define and collect main time series types #5

Here a first attempt to start a table for common data types

Here a first attempt to start a table for specific example datasets