JuliaStats / TimeSeries.jl

Time series toolkit for Julia
Other
349 stars 69 forks source link

Package functionality review #482

Open sairus7 opened 3 years ago

sairus7 commented 3 years ago

Hi! I have some experience working with time series (from medical sensors), and I was thinking of using TimeSeries.jl for my projects. For now I have some sort of review of this package, outlining choises that look strange to me, at least from docs, along with proposals from my point of view. Maybe authors will find it helpful.


I. AbstractTimeSeries is absent from docs - is this some kind of common interface for different timeseries types? If so, you should add an example, which methods should I implement to support custom timeseries type.


II. Heterogenous series (tables) are dropped, from docs:

"All the values inside the values array must be of the same type."

This is a huge limitation, if one needs timeseries with complex information, stored as vector of structures, or a namedtuple of columns of different types (see StructArrays.jl).

Maybe there should be a different TimeTable type with heterogenous columns (similar to DataFrame), and TimeArray for a single column type, sharing the same timestamps from parent table?

More than that, individual columns can be a custom AbstractVector with some metadata for exotic element types. For example, if elements are encoded and metadata is needed to decode them on getindex:


III. There is no separate implementation for timeseries with regular sample rate, that can be constrained to operations that produce a uniform sampling (similar to SampledSignals.jl). This type does not need to store materialized timestamps vector at all, since time can be calculated from index, startdate and samplerate (I call this a "time grid", which provides a index2time and time2index pair of functions). Timeseries remains uniform unless you want to take irregular / arbitraty samples from it - result is then converted to a common (non-uniform) timeseries with timestamps vector in it.


IV. There are no timeseries with several timestamp columns. In my practive, I always have three different timeseries types: 1) series - regularly sampled timeseries 2) events - irregularly sampled timeseries with one timestamp value 3) segments - irregularly samples timeseries with two timestamp values (start, stop) for elements that have some extent in time.

There are several special cases for (3) with regard to indexing (what to do if I request time point inside the segment or time interval that partially overlap with segments on edges).

Maybe there can be even more exotic (or common) timeseries with more that two timestamps (each row is itself a repetition of some complex process in time with many "phases"), where you should explicitly choose, wich timestamp column you want to index by. But I would not complicate it that far.


V. Row indexing. You can index rows by:

What is missing:


VI. Splitting by condition section has two different sets of functions:


VII. Maybe there should be some convention between functions that take and return timeseries, and functions that return standard vector types:

Also, there may be some methods to toggle between timeseries type - and underlying Table type, or standard array / vector of tuples. This is similar to Tables.columntable from DataFrames, they are using it to toggle between type-stable and compile-friendly cases.


VIII. Operation on single columns - or whole timeseries

only calculated on values that share a timestamp.

this is very tricky part, because there is implicit inner join, and all columns should be the same numeric type. So maybe it should be applied only on a single column, or a single column can be modified this way inplace? This is also about heterogenity, as in section II above.


IX. Combine methods


X. Customize TimeArray printing Can I choose a time string format to show, or is it chosen automatically based on - what? It would be nice to have examples for high-frequency timestamps in units of milliseconds.

iblislin commented 3 years ago

Hi @sairus7. I appreciate you share this valuable review. I agree with the main concepts and points from I ~ V, VII, IX, X.

There are a many designs that I'm not sure about the original ideas (I guess most of them are for financial time series). It's time for getting this package overhauled.

Let's work out each issue one-by-one,

I. The AbstractTimeSeries. I will divide this issue into two small pieces: a. The type parameter of AbstractTimeSeries. b. The function list of AbstractTimeSeries. We can listing them first, then define signature later. And consider which methods from Tables.jl/TableOperations.jl need to includes.

What's your idea about the first issue? I'm studying Tables/TableOperations, I will post my blueprint here later.

iblislin commented 3 years ago

Here is my proposal:

I. AbstractTimeSeries Design

   a. The type parameter: AbstractTimeSeries{T} where T denotes the type of time index.      a.1 if T <: Tuple, for instance: Tuple{Date,Time} <: Tuple, implies there are multiple columns forms the time index,.      a.2 if !(T <: Tuple), implies there is only one column as the time index.

   b. The interface stuffs.      b.1 Tables.jl integration

Function Support comment
Tables.istable :heavy_check_mark: AbstractTimeSeries returns true by default.
Tables.columnaccess :heavy_check_mark: true by default.
Tables.columns :heavy_check_mark: just return the AbstractTimeSeries object by default.
Tables.rowaccess :heavy_check_mark: true by default.
Tables.rows :heavy_check_mark: Maybe we need a type to represent a row. What is the common terminology for a row of a time series?
Tables.schema :heavy_check_mark: return the schema.
Tables.materializer :heavy_check_mark: return the constructor of the concrete type.
Tables.getcolumn(table, i::Int) :heavy_check_mark: It return the value vector. Setting i = 0 returns the time index vector.
Tables.getcolumn(table, nm::Symbol) :heavy_check_mark: It return the value vector. Setting nm = :timestamp returns the time index vector.
Tables.columnnames(table) :heavy_check_mark: Return the column names. And make the name :timestamp as a preserved one. Edit: I'm not sure about a fixed name of index is a good idea or not, maybe we can ask user to name the index.
Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol) Optional
Tables.partitions TBD Not sure about the use cases of this function
Tables.partitioner TBD

     b.1 TableOperations.jl integration

Function Support comment
TableOperations.select :heavy_check_mark: I think there are nothing to do for our pkg, once we got Tables.jl integration done. Just write some test cases to check the correctness.
TableOperations.transform :heavy_check_mark: test cases only, like the select function
TableOperations.filter :heavy_check_mark: test cases only
TableOperations.map :heavy_check_mark: test cases only

@sairus7 could you review these ideas? and I can make a PR for this proposal.

sairus7 commented 3 years ago

@iblis17 I.a.

AbstractTimeSeries{T} where T denotes the type of time index

Should there be any other compile-time information, except time type? What are other specific methods and specific properties with this type, that are not covered by table interfaces?

a.1, a.2

multiple columns forms the time index

Do you mean that there can be two columns with different precision for time, like this link says https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.Date ?

I think, it can be solved with "batching", where you have one timestamp type for a whole bunch of records (table of tables, or partitions), and a detailed relative timestamp for every element within that batch. So, I am not sure we should add it for a single table level. (But we can add reference time as a metadata field for a whole table object.)

What about other idea of having several timestamps (like a time interval t1:t2) for each row, I think it is similar to sparse arrays for highly duplicated data (since all rows between t1:t2 have the same values), and we should think of it after defining the main functionality.


1.b About interface stuff.

I think, first, we should outline, what is the main difference between TimeSeries and table-data packages, like DataFrames.jl or IndexedTables.jl, otherwise there is no need in TimeSeries.jl itself. Expecially, if we decide to support heterogenous column types and stick to table interface. Or, if there are some minor differences, we can just rewrite TimeSeries.jl as a thin wrapper around those packages.

From my point of view, timeseries differ from simple arrays, because they have timestamps, and some specific operations. I think of timestamps like "position in time", or a timeindex, in addition to integer element index.

So, here is a list of questions, that should be answered first:

1) Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).

2) Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?

2) Tabular operations (link1, link2), what types they do return? For example, should getindex return a row object, or just a tuple? Should select return a single-column table, or some column object, or a vector?

3) Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?

4) Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?

5) Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?

6) What if a table have many time columns, can we switch, what columns to use as timestamps?

sairus7 commented 3 years ago

Maybe @bkamins can give us his opinion on this?

bkamins commented 3 years ago

Thank you for working on this. As there are many aspects of the issue (and probably I do not grasp everything you discussed) I would start with the question what is the main use case for TimeSeries.jl? and then design against this use case.

E.g. DataFrames.jl design objective is to be maximally flexible, possibly at the cost of performance (when used correctly it is fast though), i.e. to be used when no more specialized package exists.

I would assume that TimeSeries.jl would be a more specialized package which would provide functions that may be only made available if we have a notion of time index. There are many such use cases, that are currently hard with DataFrames.jl, e.g.:

  1. lag by a period of time
  2. interpolate
  3. aggregate by some time periods
  4. smooth data

So in summary. The question is: what features TimeSeries.jl should provide so that a user would want to switch from DataFrames.jl to TimeSeries.jl for some specific task. This will probably mean that TimeSeries.jl should provide by default more restrictions than DataFrames.jl at the benefit of doing things better (as currently you can do anything with DataFrames.jl but not always fast or conveniently).

Below I say what I would find intuitive in answers to the 7 points you put:

Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).

I do not think it would be super useful (though sometimes it might be useful). So if you see big benefits of having homogeneous type I would go for homogeneous choice.

However, my intuition is that there will not be big benefits.

Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?

If you go for homogeneous type I think for sure dynamic is better (as you have type inference for free).

However, in general in DataFrames.jl although it is not type stable mostly you can easily switch to "type-stalbe" mode.

Actually I think that the crucial thing is if you want to allow to add rows to time series in place. I assume you want it (which e.g. means that you cannot use Matrix as internal representation).

Tabular operations (link1, link2), what types they do return? For example, should getindex return a row object, or just a tuple? Should select return a single-column table, or some column object, or a vector?

Do you see any uses for such a row-object? (like taking advantage that it would know its time stamp). If yes I think it is not a problem to have a custom type for it. Also do you want the row-object to be a view (like DataFrameRow) or a copy? (which eg. a Tuple) would be.

Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?

I would find keeping them sorted intuitive. In what cases would you want to allow to change a timestamp? I would feel that it should be immutable? Also I feel timestamps should be unique and that join should be performed only on timestamp (at least by default).

Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?

I would normally think that it should be a "hidden" column like index.

Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?

For me timestamp would be an index only

What if a table have many time columns, can we switch, what columns to use as timestamps?

I think it should create a new table.


Please treat these comments as loose first impressions of course.

iblislin commented 3 years ago

I.a.

Should there be any other compile-time information, except time type?

I think for AbstractTimeSeries, it is time type only and for the other subtypes, we can add other type parameter if needed. So the time type is the minimal requirement.

What are other specific methods and specific properties with this type, that are not covered by table interfaces?

ah, right. I try to list some here, and maybe we can make it completed later, once we decide some key design.

Function Return type Comment
length(::AbstractTimeSeries) Int
ndims(::AbstractTimeSeries) Int
size(::AbstractTimeSeries, ::Int) Int
axes(::AbstractTimeSeries) Int
axes(::AbstractTimeSeries, ::Int) Int
copy(::AbstractTimeSeries) AbstractTimeSeries
deepcopy(::AbstractTimeSeries) AbstractTimeSeries
similar(::AbstractTimeSeries) AbstractTimeSeries
names(::AbstractTimeSeries) Vector{Symbol}
rename(::AbstractTimeSeries, ::Pair...) AbstractTimeSeries
rename!(::AbstractTimeSeries, ::Pair...) AbstractTimeSeries
vcat(::AbstractTimeSeries, ::AbstractTimeSeries) AbstractTimeSeries
hcat(::AbstractTimeSeries, ::AbstractTimeSeries) AbstractTimeSeries
hvcat(::Tuple{Vararg{Int}}, ::AbstractTimeSeries...) AbstractTimeSeries
view(::AbstractTimeSeries, dims...) AbstractTimeSeries Seems that we need to implement a Sub- type for each concrete type, like SubArray does
first(::AbstractTimeSeries) (TBD) Maybe a AbstractTimeSeriesRow ?
last(::AbstractTimeSeries) (TBD)
stack and unstack Are there any real use cases for time series data?
join(::AbstractTimeSeries, ::AbstractTimeSeries) family AbstractTimeSeries
select(::AbstractTimeSeries, args...), select!, transform and transform! AbstractTimeSeries :question: Since user might need to create a new column from two (or more) original column, and the input type of custom callable f is critical.
There are two design: (1) timestamp + value (2) value only.
I can easily found a real case that use (2) for calculating a ratio of two columns, in this case, the timestamp is useless for f, user just write something like select(ats, [:a, :b] => /) and (2) will work perfectly. Are there any cases we need to adopt (1)?
filter(::Callable, ::AbstractTimeSeries) and filter! AbstractTimeSeries :question: Again the same story happened on filter and map, the issue of input type, (1) timestamp + value (2) value only. If we choose (2), we will got tons of function from Base or other package supported (like isnothing, iszero ... etc). Maybe we need to found cases that need (1) and investigate them. In my personal use, (2) is quite common. I need to get rid of NaN. Inf or 0 usually.
map(::Callable, ::AbstractTimeSeries) :question: I think the functionality is replaced by select and transform. And we don't need it.
moving, reduce, foldl and foldr AbstractTimeSeries :question: We can handle the case of iterating over row value by select or similar. The another dimension is calculating again whole (or a subset) column. A classic example is running mean. The input type will be (2) for running mean. Also, lag, lead, or diff only need input type (2) to work.
:question: 2: the naming for this function. IIUC, pandas named it as rolling ?

a.1, a.2

multiple columns forms the time index

Do you mean that there can be two columns with different precision for time, like this link says https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.Date ?

I want to cover both these cases (difference precision and multiple timestamp as interval) in the type parameter design. And I think these two cases can also be distinguished without problem. Tuple{Date,Time} vs Tuple{DateTime,DateTime} for example.


1.b Interface

I think, first, we should outline, what is the main difference between TimeSeries and table-data packages, like DataFrames.jl or IndexedTables.jl, otherwise there is no need in TimeSeries.jl itself. Expecially, if we decide to support heterogenous column types and stick to table interface. Or, if there are some minor differences, we can just rewrite TimeSeries.jl as a thin wrapper around those packages.

This is a hard question. Since the property of time index breaks all the rules and make wrapping around those pkgs not profitable I think. So in the beginning, I prefer not to depend on them. I keep opening mind to this issue. After we explored the enough use cases, maybe part of cases we can leverage those pkgs.

  1. Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).

  2. Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?

  3. Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?

Well, in short, my answer is that we can implement all styles if needed. There are 8 combinations of these properties:

Combination Unstored :green_circle: / Sorted :red_circle: Hetero :green_circle: / Homo :red_circle: Dynamic :green_circle: / Static :red_circle: Comment
1 :green_circle: :green_circle: :green_circle: I think this is fulfilled by DataFrames. It's the most flexible data structure. I think we don't need to create another DataFrame.
2 :red_circle: :green_circle: :green_circle: Since the timestamp got sorted, we can provide search/filter or more operations on it with better performance. And seem we have urgent need of it.
3 :green_circle: :red_circle: :green_circle:
4 :red_circle: :red_circle: :green_circle:
5 :green_circle: :green_circle: :red_circle: Is this just IndexedTable ?
6 :red_circle: :green_circle: :red_circle:
7 :green_circle: :red_circle: :red_circle:
8 :red_circle: :red_circle: :red_circle: Actually, this is the current struct TimeArray, we already implement it :joy_cat:

I cannot find cases that user need to manipulate an unsorted time series. So combination 4, 6, 8 are kept, and I think case 6 won't have enough performance benefit. Combination 4 and 8 might have benefit if the underlying structure is Matrix for row operation, but this claim needs evidence from real use cases.

So, I will vote for combination 2 as top priority then implementing combination 4 and 8 if we still have enough mental effort.

  1. Tabular operations (link1, link2), what types they do return? For example, should getindex return a row object, or just a tuple? Should select return a single-column table, or some column object, or a vector?

I managed to list them in the table of part a. For getindex, if user getindex a single row, a row object will be better, since the column information is useful. If user getindex a range of row, just return a time series object. For select, transform or filter, I think they should return an object that same as the input type.

  1. Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?

I want timestamps sorted all the time. The timestamps isn't needed to be unique, and the order between these records which shares the same timestamps is defined by user. We should make sure function provide by this pkg not change that relative order. I have some sensor generated data that share same timestamp, since the timestamp precision isn't enough.

About join, I want that time index is required but also accept optional non-index columns. I do have two dataset that need to be join with date and username. (I did it in DataFrames and convert it to TimeArray later.)

  1. Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?

I think a hidden column is fine for me. I want that user can always set/switch the index. Once set, the column will be the first column in the presentation (via print) and the Tables.jl integration will set that index as first column.

  1. Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?

Well, this is quite complex question since I encountered both situations in single project: (1) I want the raw row value without timestamp, so I can feed them into function from Base (2) I want row value + column info + timestamp, so I can easily get the specific column. I think is the same dilemma as the part a function select described. The input type of custom function problem is same as your "return type problem".

I don't have an elegant approach at this moment. I still think about it. I write down some here.

(i) Determine which is the common use case, make the common case as default. Then provide a variant function to support another. For example: filter for raw row value as input type, filter_ts for feeding row object

(ii) Always return the row object. Then provide a convert/vec to make it turn into plain Array, but I suspect this will hurt the performance?

  1. What if a table have many time columns, can we switch, what columns to use as timestamps?

Yes, the type with dynamic columns should support this funcitonality.

sairus7 commented 3 years ago

I think we should divide our methods into three distinct parts, with increasing functionality: A. Only timestamps vector with no data bound to it. B. Timestamps + data vector (a single column). C. Operations on tables.

Here are some (incomplete) considerations about theese three parts:


A. Timestamps

I agree that timeindex differ from integer index. One of the key difference - it has global "adress space". Integer index exist only within a specific collection, and can be changed or dropped when querying a subcollection or element. But timeindex refers to some "adress in time", not a certain collection, and cannot be dropped by default.

Also, while index is integer, timeindex is continuous, and it can have different precision levels (days, minutes and so on) with different rounding and comparison behaviour between two time values which have different precision.

A.1 Time types

What time types can be used: a) DateTime instance - most accurate b) Period - for time intervals, or time relative to some reference point c) Sum of two different time Periods can produce another type, CompoundPeriod d) a Number of seconds/minutes/etc… e) a Unitful.jl time unit - is used in AxisArrays.jl, not quite sure if it works with built-in types and if we have to do something to support it. f) Two values of above types?

Possible operations:

A.2. Vector of timestamps, some kind of "point process".

I agree that timestamps should be sorted and not unique. But if they can repeat, we can get several elements instead of one, when requesting only one time value. Also, should join repeat rows if it joins one row of table 1 with several rows from table 2?

It can be: 1) Arbitrary 2) Discrete (ADC integrates some continuous amplitude within fixed timesteps) 2a) Regular (with possible missing intervals) 2b) Irregular

For discrete case, should we think that each timestamp has non-zero length equal to timestep?

I have some draft examples of how I use a combination of types from A.1 (a), (b), (d) as timestamps, and transformations between them: https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8 Here I work both with absolute and relative times, but they are stored as relative indices. And for some operations I want to retrieve index itself, not a time type. There is a series of thansforms in both directions: integer index <-> number of milliseconds <-> Period <-> DateTime (sum of period and a starttime).

Possible operations:


B. Time vector / column

Here we have just two synced vectors - timestamp vector from A.2., and data vector. I will call it here a column.

Operations:


C. Table - seems like this is just a set of columns from B, which have the same timestamps vector?

What about table of combinations, IndexedTable have sorted primary key, so they are in option 6 too. Also, we missed AxisArrays.jl for option 8 (https://github.com/mbauman/Signals.jl was deprecated in favor of AxisArrays.jl), seems like it is similar to TimeSeries. AxisArrays has interesting functionality with indervals:

using AxisArrays, Dates
t = Dates.now()
timerange = t : Millisecond(5) : t + Millisecond(45)
data = reshape(1:20, :, 2) |> collect

a = AxisArray(data; time = (timerange), chan = ([:c1, :c2]))

a[time = t, chan = :c1]
a[time = t..t+Millisecond(10), chan = :c2]

I agree that we can start with option 2 from the table.

I agree we should return time series objects on getindex, select, transform or filter, and a single row object on getindex. For a row object, maybe we can transform row to data like this: data = row[]. Or it can be wrapped, similar to CategoricalValue. Type of data can be a named tuple, but only if we add static tables, or some function barriers for them. What about issues with transform or filter, somewhat similar enumerate operator comes to mind: for (i, v) in enumerate(x) ... end We can start without time in returned result and see what happen then.

I agree on joins on non-index columns. Should join work with other table types (e.g. join TimeSeries with DataFrame and produce TimeSeries)?

In general, I don't see any disagreements with your proposals.

iblislin commented 3 years ago

The next step is completing the interface specs. I think the naming issue is the most difficult part. Feel free to correct me if my naming is confused.

A. Timestamps

A.1. Time types

I only want to discuss about (b) Period and (d), and others are fine for me. About the relative time to some reference point: I think the only information that `Period` can hold about this is the 'offset' value, the information of reference point cannot be hold in this type. So let me clarify it, I want this: (b) Period: for time intervals, or time offset value (time relative to some reference point). (d) This is just a `Period`: ```julia julia> Minute <: Period true ``` Do we need a type for holding both reference point + offset value? Maybe no, at least I cannot find such type in Cartesian coordinate or in C/Cpp pointer, there isn't a type contains both info. So I think `Period` is enough. The general policy of adding a operations: - If the operation follows `Base` interface, just implement it in the specific package. - If not, define a new interface in TimeSeries, and make all types supported.

Operations

Operations | interface | comment ---------- | --------- | ------- Arithmetic | `Base.:+`, `Base.:-`, ...etc | > Add, substract.
These should be provided by `Base` or the type package. TimeSeries won't change their behavior. Reduce precision | :question: `Base.round`, `Base.ceil`, `Base.floor` | https://docs.julialang.org/en/v1/stdlib/Dates/#Rounding
so I think this one is defined by each package. Comparison | `Base.:>`, `Base.:<` | > what should we do if they have different precision?
Just let packages define their own behavior.

A.2. Vector of timestamps

(well, I know nothing about the point process before you mention it, any resource that I can consult are appreciated)

But if they can repeat, we can get several elements instead of one, when requesting only one time value. Also, should join repeat rows if it joins one row of table 1 with several rows from table 2?

Allowing repeat does introduce some API design issues, I'm not sure about which one is the good design, just write down my thought here. a) secondary index: By automatically building the secondary index that represent the given order, the join (or other) operations can be done via the compound key of timestamp + secondary index. For instance: timestamp | secondary index | value -------------- | ---------------------- | -------- 10:00:00 | 1 | foo 10:00:00 | 2 | bar 10:00:00 | 3 | qwe 10:30:00 | 1 | baz ... | | > But if they can repeat, we can get several elements instead of one, when requesting only one time value. We can provide two set of APIs. The first set will use 'secondary index=1' as default, then return the `Row` object when querying. The second set can accept the secondary index as a argument, then return a `AbstractTimeSeries` object when querying. b) parametric type By adding a boolean parameter in type that denotes that the existence of repeated timestamps, dispatch the `AbstractTimeSeries` instance to the different methods. - If the boolean param is `true`, implies there are some repeated timestamps. Then all the related operations, like querying, returns a complete `AbstractTimeSeries`. - If `false`. The behaviour of related operations is to return a `Row` object.

I have some draft examples of how I use a combination of types from A.1 (a), (b), (d) as timestamps, and transformations between them: https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8

This example is a good starting point.

A.2.i Types

So, assume that we have an abstract type `AbstractTimeAxis` that represent a vector of timestamp. (or, is there a type maintained by `JuliaData` suitable?) ```julia abstract type AbstractTimeAxis{T} <: AbstractVector{T} end ``` where `T` can be any of types mentioned in section A. I think the naming is still need more discussion and aims to not confuse users. - `TimeGrid`: It stores the starting time, frequency (or period) and optional ending time. (I consider `RegularTimestamps` from @sairus7's example as a special case of this type). - `IrregularTimeGrid`: just rename the `IrregularTImestamps` in example. - `SparseTimeGrid`: rename the `SparseTimestamps` in example.

A.2.ii Operations

a. Support the iterator protocol Some types in @sairus7's example are lazy (calculate the timestamp while needed). I will make it support the iterator protocol, so user can materialize it if desire.
b. Indexing related operations | Function | Return Type | Comment | | ------------ | ------------------- | -------------- | | `getindex(::AbstractTimeAxis, i::Int)` | `TimeType` or `KeyError` | Integer index -> `TimeType` | | `getindex(::AbstractTimeAxis, t::TimeType)` | `Integer` or `KeyError` | Integer index <- `TimeType` | | `findfirst(::Base.Fix2{typeof(>=),<:TimeType}, ::AbstractTimeAxis)`, `findlast`, `findnext` and `findprev` | `Integer` or `Nothing` | integer index <- `TimeType`. Where the `typeof(>=(Dates.now()))` is `Base.Fix2{typeof(==),DateTime}`. | | and other comparison function `>`, `==`, `!=`, `<`, `<=` with `find*` family | `Integer` or `Nothing` | Integer index <- `TimeType` | | `findfirst(::typeof(nss), ::AbstractTimeAxis)` and `findlast` | `Integer` or `Nothing` | Integer index <- `TimeType`. Where the `nss` function is defined in the section A.2.ii.h, the keyword argument `c` should be set. |
c. Relative time calculation The signature [`getindex(::TimeGrid, ::Period)` in @sairus7's example ](https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8#file-timestamps-jl-L33) is used for the relative time calculation. I think a common notation is in the form of `T+n` (e.g. `T+42`). Maybe we can have the `+` and `-` for this purpose? | Function | Return Type | Comment | | ----------- | ----------------- | ------------- | | `getindex(::AbstractTimeAxis, ::typeof(+), n::Int)` | `Nanosecond` | Step forward `n` timesteps. An example implementation could be [here](https://gist.github.com/iblis17/61c56c4c099cc80e3774dfa9ef0f5427#file-timestamps-jl-L55). | | `getindex(::AbstractTimeAxis, ::typeof(-), n::Int)` | `Nanosecond` | Step backward. | | `getindex(::AbstractTimeAxis, ::typeof(+), p::Period)` | `Int` | Index of relative time. | | `Base.:+(iter::AbstractTimeAxis, i::Union{Integer, Period})` | `AbstractTimeAxis` | Return a new iterator with both of starting and ending reference point changed | | `Base.:-(iter::AbstractTimeAxis, i::Union{Integer, Period})` | `AbstractTimeAxis` | | Example: ```julia julia> g = TimeGrid(DateTime(2020, 1, 1), 60) TimeGrid(DateTime("2020-01-01T00:00:00"), 60.0) julia> g[+, 0] 0 milliseconds julia> g[+, 42] 700 milliseconds julia> g[+, Minute(1)] 3601 ```
d. Get a subvector of timestamps Support `view` to return a new iterator. like the `RegularTimestamps` in the example code. | Function | Return Type | Comment | | ----------- | ----------------- | ------------- | | `view(iter::AbstractTimeAxis, i::ClosedInterval{Int64})` | `AbstractTimeAxis` | Return a new iterator starting from `iter + i` to `j` timesteps. The interval type is provided by [IntervalSets.jl](https://github.com/JuliaMath/IntervalSets.jl). | `getindex(iter::AbstractTimeAxis, i::ClosedInterval{Int64})` | `AbstractTimeAxis` | behaves same as `view` | `collect(iter::AbstractTimeAxis)` | `Vector` | Materialize the vector of timestamps | > changes timestamp relative to the first point of subvector? I think this can be simply achieved by `cumsum(diff(iter))`, we can just add a shorthand method for it, what's the best naming?
e. Reduction operations | Function | Return Type | Comment | | ------------ | ---------------- | ------------- | | `count(::AbstractRange, ::AbstractTimeAxis)` | `Integer` | Count the discrete time point in the time interval | | `reduce(::Base.Callable, ::AbstractTimeAxis; init)` | `Any` | The associativity of the reduction is determined by the iterator via the iteration protocol. | | `foldl(::Base.Callable, ::AbstractTimeAxis; init)` | `Any` | The associativity is small-to-big of timestamps | | `foldr(::Base.Callable, ::AbstractTimeAxis; init)` | `Any` | The associativity is big-to-small of timestamps |
f. Resampling operations - The resampling operations can perform both downsampling and upsampling - Maybe we can merge the code from https://github.com/femtotrader/TimeSeriesResampler.jl | Function | Return Type | Comment | | ------------ | ---------------- | ------------- | | `resample(::AbstractTimeAxis, ::Period)` and `resample!` | `AbstractTimeAxis` | Change the frequency of the iterators | | `resample(::AbstractTimeAxis, ::Real)` and `resample!` | `AbstractTimeAxis` | Change the frequency by the scale of origin freq |
g. Consolidate two vector of timestamps: merge and intersect This is a quite complex case dealing with the `AbstractTimeAxis`. Let discuss this issue with the type `TimeGrid` and `IrregularTimestamps`. I will consider `RegularTimestamps` as a special case of `TimeGrid`. - case 1: given two `TimeGrid` with period (or freq) _p_ and _q_, and the ratio _p/q_ is a rational number. The output type is `TimeGrid`. - case 2: similar to case 1, but the ratio _p/q_ is a irrational number. The output type is `IrregularTimestamps`. - case 3: any one of input type is a `IrrgularTimestmps`, the output is `IrregularTimestamps`. About the case 1 and 2, I reduce the problem as the product (or sum) of periodic function problem. The product (or sum) of two periodic functions may or may not be a periodic function. It depends on the period ratio. But in case of implementation, I want to treat a normal `Float64` as rational number unless user use the type `Irrational` (e.g. `Irrational{:π}`) explicitly. | Function | Return Type | Comment | | ------------ | ---------------- | ------------- | | `merge(::AbstractTimeAxis, ::AbstractTimeAxis)` or `:*(...)` | `AbstractTimeAxis` | similar to product of two periodic function | | `intersect(::AbstractTimeAxis, ::AbstractTimeAxis)` or `:+(...)` | `AbstractTimeAxis` | similar to sum of two periodic function |
h. Search the point with given criteria, like equal or nearest neighbours For two point process - > Find indices of pairs with equal timestamps. - > Find nearest neighbours (one-to-one). - > Find nearest neighbours within time radius (one-to-one). - > For previous three operations - also get all unmatched indices from first and second timestamp vector (to substract one from another) I propose implementing `findall` with extra functions/helper types representing criteria: | Function | Return Type | Comment | | ---- | ---- | ---- | | `findall(::typeof(==), ::AbstractTimeAxis, ::AbstractTimeAxis)` | `Vector{Tupe{Int,Union{Int,Missing}}}` | the indices of pairs. If unmatched, it will be filled with `missing`.| | `findall(::NearestNeighbours, ::AbstractTimeAxis, ::AbstractTimeAxis)` | `Vector{Tupe{Int,Union{Int,Missing}}}` | | | `nns(; k::Int=1, c=nothing, radius::Period, direction=:both)` | `NearestNeighbours` type | `nns` stands for nearest neighbour search. `direction` can be `:forward`, `:backward` or `:both`. `c` is the centroid and not used in this case. | - I found that [`AxisKeys.jl`](https://github.com/mcabbott/AxisKeys.jl#selections) use the naming `Near` for nearest neighbours search. Which is the better naming, `nns`, `Near` or `≈ (\approx)`?
i. Common vector operations Here I only list some operations being notable for discussion. | Function | Return Type | Comment | | ------------ | ---------------- | ------------- | | `diff(::AbstractTimeAxis)` | `AbstractTimeAxis` | Use the lazy representation if possible. |

B. Time vector / column

Here we have just two synced vectors - timestamp vector from A.2., and data vector. I will call it here a column.

And I will consider the time vector is a AbstractTimeSeries.

B.1 Operations

a. Indexing | Function | Return Type | Comment | | ----------- | ----------- | -------------- | | `getindex(::AbstractTimeSeries, ::TimeType)` | `Row` type or `KeyError` | Timestamp -> Row type. Row type will hold `index + timestamp + value` | | `getindex(::AbstractTimeSeries, ::Int)` | `Row` type | Integer index -> Row type | | `getindex(::AbstractTimeSeries, ::TimeType, ::Symbol)` | `Any` | If the `::Symbol` refers to the timestamp vector, this function returns index. For other cases, it returns the value from data vector. | | `getidnex(::AbstractTimeSeries, ::Int, ::Symbol)` | `Any` | If the `::Symbol` refers to the timestamp vector, this function returns timestamp. For other cases, it returns the value from data vector. | | `setindex!(::AbstractTimeSeries, ::AbstractTimeAxis, ::Symbol)` | `AbstractTimeSeries` | Change the timestamp vector and resort. e.g. `ts[:time] = TimeGrid(...)` | | `setindex!(::AbstractTimeSeries, ::TimeType, ::Symbol, ::Union{Int, TimeType}` | `AbstractTimeSeries` | Update a specific timestamp and resort data. e.g. `ts[:time, 42] = DateTime(...)`. or `ts[:time, DateTime(...)] = DateTime(...)` (?: is there a better API design? or this is acceptable.) | | `findfirst(::Base.Fix2{typeof(==),<:TimeType}, ::AbstractTimeSeries)` and other comparison operators | `Integer` or `Nothing` | Timestamp -> integer index. | | `findfirst(::typeof(nss), ::AbstractTimeSeries)` | `Integer` or `Nothing` | Timestamp -> integer index. Where the `nss` function is defined in section A.2.ii.h. | | `findlast`, `findnext` and `findprev` | `Integer` or `Nothing` | Timestamp -> integer index. |
b. Element-wise binary operations with two `AbstractTimeSeries` - The two `AbstractTimeSeries`s should have the same number of columns. - Determine the new timestamp. - First, determine the output type from the `merge` algorithm of two timestamp vectors from the section B. - If the new timestamp vector is still a `TimeGrid` (or `RegularTimestamp`), then check the frequency issue. And execute the higher-freq to lower-freq (many-to-one) mapping - If user want to change/reduce the output freq, one can use the `resample` function later. - The empty data value will be filled with `missing`. | Function | Return Type | Comment | | ---- | ---- | ---- | | `.+`, `.-`, `.*`, `./` and other arithmetic operations | `AbstractTimeSeries` | |
c. Get a sub-table or view | Function | Return Type | Comment | | ---- | ---- | ---- | | `view(::AbstractTimeSeries, ::Union{Int, UnitRange{<:Int}, Colon, TimeType, AbstractTimeAxis}, ::Union{Int, UnitRange{<:Int}, Colon, Symbol, Vector{Symbol}} )` | Sub-table type | should not copy the data | | `getindex` with the same arguments as `view` | `AbstractTimeSeries` | It will copy | | `filter(::typeof(nss), ::AbstractTimeSeries)` and `filter!` | `AbstractTimeSeries` | `filter!` is optionally supported. | | `filter(::Base.Callable, ::AbstractTimeSeries)` and `filter!` | `AbstractTimeSeries` | If the number of col is 1, the input of the custom function is the scalar value. If # of column is more than 2, the input of custom function is a named tuple. `filter!` is optionally supported. | | `select(::AbstractTimeSeries, ::Symbol...)` and `select!` | `AbstractTimeSeries` | `select!` is optionally supported. |
d. `join` operations - The method naming issue: (https://github.com/JuliaData/DataFrames.jl/issues/2092). One of the point is to be consistent with SQL and dplyr. - Include DataAPI to extend the interface. (https://github.com/JuliaData/DataAPI.jl/pull/36) - `corssjoin` won't be supported. - In all `join` operations, the time axis is the required join key. - if the nearest neighbour search is needed for join condition, pass its output to `on`. e.g. `on = nns(...)`. - The `on` kwarg can be passed a vector of column names to create a compound join key. | Function | Return Type | Comment | | ------ | ------- | ------ | | `innerjoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | One-to-one comparison only. No extra handling for high-to-low or low-to-high frequency cases. | | `outerjoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | One-to-one comparison only.| | `leftjoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | | | `rightjoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | | | `semijoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | | | `antijoin(::AbstractTimeSeries, ::AbstractTimeSeries; on = nothing)` | `AbstractTimeSeries` | |
e. Change the time vector | Function | Return Type | Comment | | ---- | ---- | ---- | | `lag(::AbstractTimeSeries, n::Integer)` | `AbstractTimeSeries` | `lag!` is optionally supported | | `lead(::AbstractTimeSeries, n::Integer)` | `AbstractTimeSeries` | `lead!` is optionally supported | | `reindex(::AbstractTimeSeries, AbstractTimeAxis; pad = :const, padvalue = missing)` | `AbstractTimeSeries` | Change the time axis. Are there other naming options? the term `reindex` exists in [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.reindex.html). The inplace method `reindex!` is optionally supported. The padding method for missing value is controlled by the keywrod arg `pad`. `pad`ing method can be `:const`, `:forward`, `:backward`, `:nearest`. |
f. Reduction operations on values for single table | Function | Return Type | Comment | | ---- | ---- | ---- | | `maximum(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | The kwarg `dims` is required. If `dims=1`, it returns a table with the latest timestamp. | | `minimum(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | The kwarg `dims` is required. If `dims=1`, it returns a table with the first timestamp. | | `findmax(::AbstractTimeSeries; dims::Integer)` | `Tuple{AbstractTimeAxis,Vector{Int}}` | `dims` is required. | | `findmin(::AbstractTimeSeries; dims::Integer)` | `Tuple{AbstractTimeAxis,Vector{Int}}` | `dims` is required. | | `cumsum(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | `dims` is required. If `dims=1`, it returns a table with the first timestamp. | | `cumprod(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | `dims` is required. If `dims=1`, it returns a table with the first timestamp. | | `argmax(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | The kwarg `dims` is required. If `dims=1`, it returns a table with the latest timestamp. | | `argmin(::AbstractTimeSeries; dims::Integer)` | `AbstractTimeSeries` | The kwarg `dims` is required. If `dims=1`, it returns a table with the first timestamp. |
g. Resampling operations | Function | Return Type | Comment | | ---- | ---- | ---- | | `resample(f::Base.Callable, ::AbstractTimeSeries, ::Period)` and `resample!` | `AbstractTimeSeries` | The input of function `f` is a view of `AbstractTimeSeries`, the output should be a `AbstractTimeSeries` (or maybe support `NameTuple` ?). `resample!` is optionally supported. | | `resample(f::Base.Callable, ::AbstractTimeSeries, ::Real)` and `resample!` | `AbstractTimeSeries` | The input of function `f` is a view of `AbstractTimeSeries`, the output should be a `AbstractTimeSeries`. `resample!` is optionally supported. |

C. Table

seems like this is just a set of columns from B, which have the same timestamps vector?

Yes, so how about the treat the section B and C as the same? I think there aren't different operations between B and C.

Also, we missed AxisArrays.jl for option 8 (https://github.com/mbauman/Signals.jl was deprecated in favor of AxisArrays.jl), seems like it is similar to TimeSeries. AxisArrays has interesting functionality with indervals:

The interval feature looks great. If I understand correctly, that interval data type is provided by IntervalSets.jl, and we can support it.

sairus7 commented 3 years ago

A side note about precision and rounding, which is closely related to the question from A.2: "should we think that each timestamp has non-zero length equal to timestep?"

Why would we need it? I think of how to represent time segments (intervals) as timestamps, and the main difference is that intervals have additional "time length" attribute. Which makes me think that any timestamp is not a point with zero length, but a time interval with "unit" length. This is similar to the inner representation of timestamp itself as integer value (UTInstant) of either nanoseconds, minutes, days, moths, etc., so any floating-point value is trucnated to the nearest previous integer.

But AFAIK there are no methods to check that higher-resolution timestamp lies within a lower-resolution timestamp. More than that, we even don't know the actual resolution, right?

using Dates
t_month = floor(Dates.now(), Dates.Month)
t_sec = floor(Dates.now(), Dates.Second)
t_sec in t_month == true # method error

From this example t_sec should start with current second it points to and end just before the next second. t_month should start with first day of the current month and end before the first day of the next month. With this knowledge we can more naturally join, groupby (or resample) two timestamp vectors with different known resolutions.

I'm not sure if we should leave this to user knowledge of his data, or decide to make some additional time-interval operations and check for (or dispatch on) known and unknown time-length. But if we do, then we should add some additional timestamp vector types with metadata.

iblislin commented 3 years ago

I'm not sure if we should leave this to user knowledge of his data, or decide to make some additional time-interval operations and check for (or dispatch on) known and unknown time-length. But if we do, then we should add some additional timestamp vector types with metadata.

I think the "time length" attribute will only related to additional operations. It only meaningful when doing operations against the time length attribute, we won't getindex and inspect a single time length, right? I will design this feature as mimicking the isless function of sort(..., lt=isless). e.g. Make a bunch of time-length measurement functions that can apply to join, groupby ... etc via a keywrod arg. Then, we can have a default function as you describe previously.

I googled around this topics randomly. Maybe we can consult some operation designs from here: https://www.codeproject.com/Articles/168662/Time-Period-Library-for-NET

I'm not sure if we should leave this to user knowledge of his data, ...

So, yes, we should leave it to user knowledge, but with a common assumption as default.

Arkoniak commented 3 years ago

Sorry, I late for the party, but I have a couple of things to add.

segments - irregularly samples timeseries with two timestamp values (start, stop) for elements that have some extent in time.

I've met this situation too, but there is an easy(?) workaround, at least it worked for me. Since TimeArray accept any TimeType type, user can define

struct DateTimeBar{T <: TimeType, L <: Real} <: TimeType
  ts::T
  duration::L
end

duration(x::DateTimeBar) = x.duration
Base.isless(x1::DateTimeBar, x2.::DateTimeBar) = isless(x1.ts, x2.ts)

and generate a vector of "bar" times. There is no need to create an extra column or do anything like that.

Something like that can work with counting times

struct CountingDateTime{T <: TimeType, L <: Period} <: TimeType
  start::T
  offset::L
  counter::Int
end

DateTime(x::CountingDateTime) = start + counter * offset

so specialized functions can be written if needed (maybe even in another package?) to work with such type.

Exploring this idea further, one can define

struct DateTimeWithKeys{T <: TimeType, S <: Tuple} <: TimeType
  ts::T
  keys::S
end

and generate time column with embedded keys, for example, if you gather signal from different sources, you can have something like

dts = [DateTimeWithKeys(Date("2021-01-01"), ("Device A", )),
          DateTimeWithKeys(Date("2021-01-01"), ("Device B", )),
          DateTimeWithKeys(Date("2021-01-02"), ("Device A", )),
          DateTimeWithKeys(Date("2021-01-03"), ("Device C", ))]

and "keys" can be used for filtering, joining, sorting, etc. This idea is actually implemented in google's BigTable design http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf

Regarding Row values which should be returned when a table is indexed, maybe it makes sense to utilize JuliaQuant/Timestamps.jl? I revive it recently after few years of hibernation, and one of the ideas was to have a useful row-level timestamp data presentation. It can solve some questions like "what to return value or timestamp + value" since you can return Timestamp and provide utils to work with it conveniently. The package is in its infancy now, so it is easy to adapt it to the needs of TimeSeries.

iblislin commented 3 years ago

Regarding Row values which should be returned when a table is indexed, maybe it makes sense to utilize JuliaQuant/Timestamps.jl?

Oh, that may be a good option. After I finish the interface spec in this discussion thread, we can check Timestamps.jl fits or not.

iblislin commented 3 years ago

@sairus7 I think the first draft of the interface spec is finished: https://github.com/JuliaStats/TimeSeries.jl/issues/482#issuecomment-792278887.

Could you review it?

rafaqz commented 2 years ago

I'm looking into using the methods in this package in DimensionalData.jl/GeoData.jl - when there is a time dimension present, as in AxisArrays.jl. Often we have multidimensional arrays where time is one of the dimensions.

GeoData.jl also defines GeoSeries where separate (often disk-based) objects are organised in a timeseries (and will usually load as an AbstractArray). It would be good to be able to apply the functions here over these multi-array series.

So to add to this functionality review, it would be useful if this package generalised to working with any arbitrary-dimension arrays organised in a time-series vector, somewhat like how Interpolations.jl does that.

ParadaCarleton commented 2 years ago

Has any kind of AbstractTimeSeries interface been implemented? I ask because I'm interested in writing up an autocovariance estimation interface for StatsBase. I think it'd be very nice to have some way to wrap an arbitrary table or array in a time series and then have StatsBase functions like sem work on it automatically.