Open sairus7 opened 3 years ago
Hi @sairus7. I appreciate you share this valuable review. I agree with the main concepts and points from I ~ V, VII, IX, X.
filter
function (some discussion can be found at #456) and add the integration of TableOperations.jl
.basecall
is the method you want, if we can handle the case of heterogenous series. (And maybe we can work out a better name for basecall
)There are a many designs that I'm not sure about the original ideas (I guess most of them are for financial time series). It's time for getting this package overhauled.
Let's work out each issue one-by-one,
I. The AbstractTimeSeries
. I will divide this issue into two small pieces:
a. The type parameter of AbstractTimeSeries
.
b. The function list of AbstractTimeSeries
. We can listing them first, then define signature later. And consider which methods from Tables.jl/TableOperations.jl need to includes.
What's your idea about the first issue? I'm studying Tables/TableOperations, I will post my blueprint here later.
Here is my proposal:
I. AbstractTimeSeries
Design
a. The type parameter: AbstractTimeSeries{T}
where T
denotes the type of time index.
a.1 if T <: Tuple
, for instance: Tuple{Date,Time} <: Tuple
, implies there are multiple columns forms the time index,.
a.2 if !(T <: Tuple)
, implies there is only one column as the time index.
b. The interface stuffs.
b.1 Tables.jl
integration
Function | Support | comment |
---|---|---|
Tables.istable |
:heavy_check_mark: | AbstractTimeSeries returns true by default. |
Tables.columnaccess |
:heavy_check_mark: | true by default. |
Tables.columns |
:heavy_check_mark: | just return the AbstractTimeSeries object by default. |
Tables.rowaccess |
:heavy_check_mark: | true by default. |
Tables.rows |
:heavy_check_mark: | Maybe we need a type to represent a row. What is the common terminology for a row of a time series? |
Tables.schema |
:heavy_check_mark: | return the schema. |
Tables.materializer |
:heavy_check_mark: | return the constructor of the concrete type. |
Tables.getcolumn(table, i::Int) |
:heavy_check_mark: | It return the value vector. Setting i = 0 returns the time index vector. |
Tables.getcolumn(table, nm::Symbol) |
:heavy_check_mark: | It return the value vector. nm = :timestamp returns the time index vector. |
Tables.columnnames(table) |
:heavy_check_mark: | Return the column names. :timestamp as a preserved one. |
Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol) |
Optional | |
Tables.partitions |
TBD | Not sure about the use cases of this function |
Tables.partitioner |
TBD |
b.1 TableOperations.jl
integration
Function | Support | comment |
---|---|---|
TableOperations.select | :heavy_check_mark: | I think there are nothing to do for our pkg, once we got Tables.jl integration done. Just write some test cases to check the correctness. |
TableOperations.transform | :heavy_check_mark: | test cases only, like the select function |
TableOperations.filter | :heavy_check_mark: | test cases only |
TableOperations.map | :heavy_check_mark: | test cases only |
@sairus7 could you review these ideas? and I can make a PR for this proposal.
@iblis17 I.a.
AbstractTimeSeries{T}
whereT
denotes the type of time index
Should there be any other compile-time information, except time type? What are other specific methods and specific properties with this type, that are not covered by table interfaces?
a.1, a.2
multiple columns forms the time index
Do you mean that there can be two columns with different precision for time, like this link says https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.Date ?
I think, it can be solved with "batching", where you have one timestamp type for a whole bunch of records (table of tables, or partitions), and a detailed relative timestamp for every element within that batch. So, I am not sure we should add it for a single table level. (But we can add reference time as a metadata field for a whole table object.)
What about other idea of having several timestamps (like a time interval t1:t2) for each row, I think it is similar to sparse arrays for highly duplicated data (since all rows between t1:t2 have the same values), and we should think of it after defining the main functionality.
1.b About interface stuff.
I think, first, we should outline, what is the main difference between TimeSeries and table-data packages, like DataFrames.jl or IndexedTables.jl, otherwise there is no need in TimeSeries.jl itself. Expecially, if we decide to support heterogenous column types and stick to table interface. Or, if there are some minor differences, we can just rewrite TimeSeries.jl as a thin wrapper around those packages.
From my point of view, timeseries differ from simple arrays, because they have timestamps, and some specific operations. I think of timestamps like "position in time", or a timeindex, in addition to integer element index.
So, here is a list of questions, that should be answered first:
1) Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).
2) Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?
2) Tabular operations (link1, link2), what types they do return? For example, should getindex
return a row object, or just a tuple? Should select
return a single-column table, or some column object, or a vector?
3) Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join
only by timestamp?
4) Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?
5) Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?
6) What if a table have many time columns, can we switch, what columns to use as timestamps?
Maybe @bkamins can give us his opinion on this?
Thank you for working on this. As there are many aspects of the issue (and probably I do not grasp everything you discussed) I would start with the question what is the main use case for TimeSeries.jl? and then design against this use case.
E.g. DataFrames.jl design objective is to be maximally flexible, possibly at the cost of performance (when used correctly it is fast though), i.e. to be used when no more specialized package exists.
I would assume that TimeSeries.jl would be a more specialized package which would provide functions that may be only made available if we have a notion of time index. There are many such use cases, that are currently hard with DataFrames.jl, e.g.:
So in summary. The question is: what features TimeSeries.jl should provide so that a user would want to switch from DataFrames.jl to TimeSeries.jl for some specific task. This will probably mean that TimeSeries.jl should provide by default more restrictions than DataFrames.jl at the benefit of doing things better (as currently you can do anything with DataFrames.jl but not always fast or conveniently).
Below I say what I would find intuitive in answers to the 7 points you put:
Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).
I do not think it would be super useful (though sometimes it might be useful). So if you see big benefits of having homogeneous type I would go for homogeneous choice.
However, my intuition is that there will not be big benefits.
Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?
If you go for homogeneous type I think for sure dynamic is better (as you have type inference for free).
However, in general in DataFrames.jl although it is not type stable mostly you can easily switch to "type-stalbe" mode.
Actually I think that the crucial thing is if you want to allow to add rows to time series in place. I assume you want it (which e.g. means that you cannot use Matrix
as internal representation).
Tabular operations (link1, link2), what types they do return? For example, should
getindex
return a row object, or just a tuple? Should select return a single-column table, or some column object, or a vector?
Do you see any uses for such a row-object? (like taking advantage that it would know its time stamp). If yes I think it is not a problem to have a custom type for it. Also do you want the row-object to be a view (like DataFrameRow
) or a copy? (which eg. a Tuple
) would be.
Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?
I would find keeping them sorted intuitive. In what cases would you want to allow to change a timestamp? I would feel that it should be immutable? Also I feel timestamps should be unique and that join should be performed only on timestamp (at least by default).
Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?
I would normally think that it should be a "hidden" column like index.
Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?
For me timestamp would be an index only
What if a table have many time columns, can we switch, what columns to use as timestamps?
I think it should create a new table.
Please treat these comments as loose first impressions of course.
I.a.
Should there be any other compile-time information, except time type?
I think for AbstractTimeSeries
, it is time type only and for the other subtypes, we can add other type parameter if needed.
So the time type is the minimal requirement.
What are other specific methods and specific properties with this type, that are not covered by table interfaces?
ah, right. I try to list some here, and maybe we can make it completed later, once we decide some key design.
Function | Return type | Comment |
---|---|---|
length(::AbstractTimeSeries) |
Int |
|
ndims(::AbstractTimeSeries) |
Int |
|
size(::AbstractTimeSeries, ::Int) |
Int |
|
axes(::AbstractTimeSeries) |
Int |
|
axes(::AbstractTimeSeries, ::Int) |
Int |
|
copy(::AbstractTimeSeries) |
AbstractTimeSeries |
|
deepcopy(::AbstractTimeSeries) |
AbstractTimeSeries |
|
similar(::AbstractTimeSeries) |
AbstractTimeSeries |
|
names(::AbstractTimeSeries) |
Vector{Symbol} |
|
rename(::AbstractTimeSeries, ::Pair...) |
AbstractTimeSeries |
|
rename!(::AbstractTimeSeries, ::Pair...) |
AbstractTimeSeries |
|
vcat(::AbstractTimeSeries, ::AbstractTimeSeries) |
AbstractTimeSeries |
|
hcat(::AbstractTimeSeries, ::AbstractTimeSeries) |
AbstractTimeSeries |
|
hvcat(::Tuple{Vararg{Int}}, ::AbstractTimeSeries...) |
AbstractTimeSeries |
|
view(::AbstractTimeSeries, dims...) |
AbstractTimeSeries |
Seems that we need to implement a Sub- type for each concrete type, like SubArray does |
first(::AbstractTimeSeries) |
(TBD) | Maybe a AbstractTimeSeriesRow ? |
last(::AbstractTimeSeries) |
(TBD) | |
stack and unstack |
Are there any real use cases for time series data? | |
join(::AbstractTimeSeries, ::AbstractTimeSeries) family |
AbstractTimeSeries |
|
select(::AbstractTimeSeries, args...) , select! , transform and transform! |
AbstractTimeSeries |
:question: Since user might need to create a new column from two (or more) original column, and the input type of custom callable f is critical. There are two design: (1) timestamp + value (2) value only. I can easily found a real case that use (2) for calculating a ratio of two columns, in this case, the timestamp is useless for f , user just write something like select(ats, [:a, :b] => /) and (2) will work perfectly. Are there any cases we need to adopt (1)? |
filter(::Callable, ::AbstractTimeSeries) and filter! |
AbstractTimeSeries |
:question: Again the same story happened on filter and map , the issue of input type, (1) timestamp + value (2) value only. If we choose (2), we will got tons of function from Base or other package supported (like isnothing , iszero ... etc). Maybe we need to found cases that need (1) and investigate them. In my personal use, (2) is quite common. I need to get rid of NaN . Inf or 0 usually. |
map(::Callable, ::AbstractTimeSeries) |
:question: I think the functionality is replaced by select and transform . And we don't need it. |
|
moving , reduce , foldl and foldr |
AbstractTimeSeries |
:question: We can handle the case of iterating over row value by select or similar. The another dimension is calculating again whole (or a subset) column. A classic example is running mean. The input type will be (2) for running mean. Also, lag , lead , or diff only need input type (2) to work. :question: 2: the naming for this function. IIUC, pandas named it as rolling ? |
a.1, a.2
multiple columns forms the time index
Do you mean that there can be two columns with different precision for time, like this link says https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.Date ?
I want to cover both these cases (difference precision and multiple timestamp as interval) in the type parameter design.
And I think these two cases can also be distinguished without problem. Tuple{Date,Time}
vs Tuple{DateTime,DateTime}
for example.
1.b Interface
I think, first, we should outline, what is the main difference between TimeSeries and table-data packages, like DataFrames.jl or IndexedTables.jl, otherwise there is no need in TimeSeries.jl itself. Expecially, if we decide to support heterogenous column types and stick to table interface. Or, if there are some minor differences, we can just rewrite TimeSeries.jl as a thin wrapper around those packages.
This is a hard question. Since the property of time index breaks all the rules and make wrapping around those pkgs not profitable I think. So in the beginning, I prefer not to depend on them. I keep opening mind to this issue. After we explored the enough use cases, maybe part of cases we can leverage those pkgs.
Heterogenous column types - yes or not? If we treat timeseries as any data, that is bound to a timestamp, then we should add columns of different types. On the other side, using matrix we have convenient matrix operations over homogenous channel groups. Although we can always wrap a whole matrix as a single column (at the cost of not operate with individual columns within matrix).
Static or dynamic columns? Can we add columns to an existing timeseries, or column number and types (schema) are known in parametric types at compile-time, like IndexedTables.jl does?
Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?
Well, in short, my answer is that we can implement all styles if needed. There are 8 combinations of these properties:
Combination | Unstored :green_circle: / Sorted :red_circle: | Hetero :green_circle: / Homo :red_circle: | Dynamic :green_circle: / Static :red_circle: | Comment |
---|---|---|---|---|
1 | :green_circle: | :green_circle: | :green_circle: | I think this is fulfilled by DataFrames . It's the most flexible data structure. I think we don't need to create another DataFrame . |
2 | :red_circle: | :green_circle: | :green_circle: | Since the timestamp got sorted, we can provide search/filter or more operations on it with better performance. And seem we have urgent need of it. |
3 | :green_circle: | :red_circle: | :green_circle: | |
4 | :red_circle: | :red_circle: | :green_circle: | |
5 | :green_circle: | :green_circle: | :red_circle: | Is this just IndexedTable ? |
6 | :red_circle: | :green_circle: | :red_circle: | |
7 | :green_circle: | :red_circle: | :red_circle: | |
8 | :red_circle: | :red_circle: | :red_circle: | Actually, this is the current struct TimeArray , we already implement it :joy_cat: |
I cannot find cases that user need to manipulate an unsorted time series. So combination 4, 6, 8 are kept, and I think case 6 won't have enough performance benefit. Combination 4 and 8 might have benefit if the underlying structure is Matrix
for row operation, but this claim needs evidence from real use cases.
So, I will vote for combination 2 as top priority then implementing combination 4 and 8 if we still have enough mental effort.
- Tabular operations (link1, link2), what types they do return? For example, should getindex return a row object, or just a tuple? Should select return a single-column table, or some column object, or a vector?
I managed to list them in the table of part a
.
For getindex
, if user getindex
a single row, a row object will be better, since the column information is useful. If user getindex
a range of row, just return a time series object.
For select
, transform
or filter
, I think they should return an object that same as the input type.
- Should we always store timestamps in sorted order? So, if we insert a new row, or change a timestamp, we should always re-sort rows? Should they be unique or not? Should we restrict join only by timestamp?
I want timestamps sorted all the time. The timestamps isn't needed to be unique, and the order between these records which shares the same timestamps is defined by user. We should make sure function provide by this pkg not change that relative order. I have some sensor generated data that share same timestamp, since the timestamp precision isn't enough.
About join
, I want that time index is required but also accept optional non-index columns.
I do have two dataset that need to be join with date and username. (I did it in DataFrames and convert it to TimeArray later.)
- Should timestamps be always part of table data, or not? Like, should it always be the first column? Should it ever be in column list, or a "hidden" column like index, or both options?
I think a hidden column is fine for me. I want that user can always set/switch the index. Once set, the column will be the first column in the presentation (via print
) and the Tables.jl integration will set that index as first column.
- Should we always add timestamp to any return type (subtable / row / column), even if it is not queried from table? What is a row, timestamp + row data? What is a column, timestamp + column data?
Well, this is quite complex question since I encountered both situations in single project: (1) I want the raw row value without timestamp, so I can feed them into function from Base
(2) I want row value + column info + timestamp, so I can easily get the specific column.
I think is the same dilemma as the part a
function select
described. The input type of custom function problem is same as your "return type problem".
I don't have an elegant approach at this moment. I still think about it. I write down some here.
(i) Determine which is the common use case, make the common case as default. Then provide a variant function to support another. For example: filter
for raw row value as input type, filter_ts
for feeding row object
(ii) Always return the row object. Then provide a convert
/vec
to make it turn into plain Array
, but I suspect this will hurt the performance?
- What if a table have many time columns, can we switch, what columns to use as timestamps?
Yes, the type with dynamic columns should support this funcitonality.
I think we should divide our methods into three distinct parts, with increasing functionality: A. Only timestamps vector with no data bound to it. B. Timestamps + data vector (a single column). C. Operations on tables.
Here are some (incomplete) considerations about theese three parts:
A. Timestamps
I agree that timeindex differ from integer index. One of the key difference - it has global "adress space". Integer index exist only within a specific collection, and can be changed or dropped when querying a subcollection or element. But timeindex refers to some "adress in time", not a certain collection, and cannot be dropped by default.
Also, while index is integer, timeindex is continuous, and it can have different precision levels (days, minutes and so on) with different rounding and comparison behaviour between two time values which have different precision.
A.1 Time types
What time types can be used: a) DateTime instance - most accurate b) Period - for time intervals, or time relative to some reference point c) Sum of two different time Periods can produce another type, CompoundPeriod d) a Number of seconds/minutes/etc… e) a Unitful.jl time unit - is used in AxisArrays.jl, not quite sure if it works with built-in types and if we have to do something to support it. f) Two values of above types?
Possible operations:
A.2. Vector of timestamps, some kind of "point process".
I agree that timestamps should be sorted and not unique.
But if they can repeat, we can get several elements instead of one, when requesting only one time value. Also, should join
repeat rows if it joins one row of table 1 with several rows from table 2?
It can be: 1) Arbitrary 2) Discrete (ADC integrates some continuous amplitude within fixed timesteps) 2a) Regular (with possible missing intervals) 2b) Irregular
For discrete case, should we think that each timestamp has non-zero length equal to timestep?
I have some draft examples of how I use a combination of types from A.1 (a), (b), (d) as timestamps, and transformations between them: https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8 Here I work both with absolute and relative times, but they are stored as relative indices. And for some operations I want to retrieve index itself, not a time type. There is a series of thansforms in both directions: integer index <-> number of milliseconds <-> Period <-> DateTime (sum of period and a starttime).
Possible operations:
B. Time vector / column
Here we have just two synced vectors - timestamp vector from A.2., and data vector. I will call it here a column.
Operations:
C. Table - seems like this is just a set of columns from B, which have the same timestamps vector?
What about table of combinations, IndexedTable have sorted primary key, so they are in option 6 too. Also, we missed AxisArrays.jl for option 8 (https://github.com/mbauman/Signals.jl was deprecated in favor of AxisArrays.jl), seems like it is similar to TimeSeries. AxisArrays has interesting functionality with indervals:
using AxisArrays, Dates
t = Dates.now()
timerange = t : Millisecond(5) : t + Millisecond(45)
data = reshape(1:20, :, 2) |> collect
a = AxisArray(data; time = (timerange), chan = ([:c1, :c2]))
a[time = t, chan = :c1]
a[time = t..t+Millisecond(10), chan = :c2]
I agree that we can start with option 2 from the table.
I agree we should return time series objects on getindex
, select
, transform
or filter
, and a single row object on getindex
. For a row object, maybe we can transform row to data like this: data = row[]
. Or it can be wrapped, similar to CategoricalValue. Type of data can be a named tuple, but only if we add static tables, or some function barriers for them.
What about issues with transform
or filter
, somewhat similar enumerate
operator comes to mind:
for (i, v) in enumerate(x) ... end
We can start without time in returned result and see what happen then.
I agree on joins on non-index columns. Should join
work with other table types (e.g. join TimeSeries with DataFrame and produce TimeSeries)?
In general, I don't see any disagreements with your proposals.
The next step is completing the interface specs. I think the naming issue is the most difficult part. Feel free to correct me if my naming is confused.
(well, I know nothing about the point process before you mention it, any resource that I can consult are appreciated)
But if they can repeat, we can get several elements instead of one, when requesting only one time value. Also, should join repeat rows if it joins one row of table 1 with several rows from table 2?
I have some draft examples of how I use a combination of types from A.1 (a), (b), (d) as timestamps, and transformations between them: https://gist.github.com/sairus7/7a3f2ea6d3e0c34b4ea973d3b80105e8
This example is a good starting point.
Here we have just two synced vectors - timestamp vector from A.2., and data vector. I will call it here a column.
And I will consider the time vector is a AbstractTimeSeries
.
seems like this is just a set of columns from B, which have the same timestamps vector?
Yes, so how about the treat the section B and C as the same? I think there aren't different operations between B and C.
Also, we missed AxisArrays.jl for option 8 (https://github.com/mbauman/Signals.jl was deprecated in favor of AxisArrays.jl), seems like it is similar to TimeSeries. AxisArrays has interesting functionality with indervals:
The interval feature looks great. If I understand correctly, that interval data type is provided by IntervalSets.jl, and we can support it.
A side note about precision and rounding, which is closely related to the question from A.2: "should we think that each timestamp has non-zero length equal to timestep?"
Why would we need it? I think of how to represent time segments (intervals) as timestamps, and the main difference is that intervals have additional "time length" attribute. Which makes me think that any timestamp is not a point with zero length, but a time interval with "unit" length. This is similar to the inner representation of timestamp itself as integer value (UTInstant
) of either nanoseconds, minutes, days, moths, etc., so any floating-point value is trucnated to the nearest previous integer.
But AFAIK there are no methods to check that higher-resolution timestamp lies within a lower-resolution timestamp. More than that, we even don't know the actual resolution, right?
using Dates
t_month = floor(Dates.now(), Dates.Month)
t_sec = floor(Dates.now(), Dates.Second)
t_sec in t_month == true # method error
From this example t_sec
should start with current second it points to and end just before the next second.
t_month
should start with first day of the current month and end before the first day of the next month. With this knowledge we can more naturally join
, groupby
(or resample
) two timestamp vectors with different known resolutions.
I'm not sure if we should leave this to user knowledge of his data, or decide to make some additional time-interval operations and check for (or dispatch on) known and unknown time-length. But if we do, then we should add some additional timestamp vector types with metadata.
I'm not sure if we should leave this to user knowledge of his data, or decide to make some additional time-interval operations and check for (or dispatch on) known and unknown time-length. But if we do, then we should add some additional timestamp vector types with metadata.
I think the "time length" attribute will only related to additional operations. It only meaningful when doing operations against the time length attribute, we won't getindex
and inspect a single time length, right?
I will design this feature as mimicking the isless
function of sort(..., lt=isless)
.
e.g. Make a bunch of time-length measurement functions that can apply to join
, groupby
... etc via a keywrod arg. Then, we can have a default function as you describe previously.
I googled around this topics randomly. Maybe we can consult some operation designs from here: https://www.codeproject.com/Articles/168662/Time-Period-Library-for-NET
I'm not sure if we should leave this to user knowledge of his data, ...
So, yes, we should leave it to user knowledge, but with a common assumption as default.
Sorry, I late for the party, but I have a couple of things to add.
segments - irregularly samples timeseries with two timestamp values (start, stop) for elements that have some extent in time.
I've met this situation too, but there is an easy(?) workaround, at least it worked for me. Since TimeArray
accept any TimeType
type, user can define
struct DateTimeBar{T <: TimeType, L <: Real} <: TimeType
ts::T
duration::L
end
duration(x::DateTimeBar) = x.duration
Base.isless(x1::DateTimeBar, x2.::DateTimeBar) = isless(x1.ts, x2.ts)
and generate a vector of "bar" times. There is no need to create an extra column or do anything like that.
Something like that can work with counting
times
struct CountingDateTime{T <: TimeType, L <: Period} <: TimeType
start::T
offset::L
counter::Int
end
DateTime(x::CountingDateTime) = start + counter * offset
so specialized functions can be written if needed (maybe even in another package?) to work with such type.
Exploring this idea further, one can define
struct DateTimeWithKeys{T <: TimeType, S <: Tuple} <: TimeType
ts::T
keys::S
end
and generate time column with embedded keys, for example, if you gather signal from different sources, you can have something like
dts = [DateTimeWithKeys(Date("2021-01-01"), ("Device A", )),
DateTimeWithKeys(Date("2021-01-01"), ("Device B", )),
DateTimeWithKeys(Date("2021-01-02"), ("Device A", )),
DateTimeWithKeys(Date("2021-01-03"), ("Device C", ))]
and "keys" can be used for filtering, joining, sorting, etc. This idea is actually implemented in google's BigTable design http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
Regarding Row values which should be returned when a table is indexed, maybe it makes sense to utilize JuliaQuant/Timestamps.jl? I revive it recently after few years of hibernation, and one of the ideas was to have a useful row-level timestamp data presentation. It can solve some questions like "what to return value or timestamp + value" since you can return Timestamp
and provide utils to work with it conveniently. The package is in its infancy now, so it is easy to adapt it to the needs of TimeSeries.
Regarding Row values which should be returned when a table is indexed, maybe it makes sense to utilize JuliaQuant/Timestamps.jl?
Oh, that may be a good option. After I finish the interface spec in this discussion thread, we can check Timestamps.jl
fits or not.
@sairus7 I think the first draft of the interface spec is finished: https://github.com/JuliaStats/TimeSeries.jl/issues/482#issuecomment-792278887.
Could you review it?
I'm looking into using the methods in this package in DimensionalData.jl/GeoData.jl - when there is a time dimension present, as in AxisArrays.jl. Often we have multidimensional arrays where time is one of the dimensions.
GeoData.jl also defines GeoSeries
where separate (often disk-based) objects are organised in a timeseries (and will usually load as an AbstractArray). It would be good to be able to apply the functions here over these multi-array series.
So to add to this functionality review, it would be useful if this package generalised to working with any arbitrary-dimension arrays organised in a time-series vector, somewhat like how Interpolations.jl does that.
Has any kind of AbstractTimeSeries
interface been implemented? I ask because I'm interested in writing up an autocovariance estimation interface for StatsBase. I think it'd be very nice to have some way to wrap an arbitrary table or array in a time series and then have StatsBase
functions like sem
work on it automatically.
Hi! I have some experience working with time series (from medical sensors), and I was thinking of using TimeSeries.jl for my projects. For now I have some sort of review of this package, outlining choises that look strange to me, at least from docs, along with proposals from my point of view. Maybe authors will find it helpful.
I.
AbstractTimeSeries
is absent from docs - is this some kind of common interface for different timeseries types? If so, you should add an example, which methods should I implement to support custom timeseries type.II. Heterogenous series (tables) are dropped, from docs:
This is a huge limitation, if one needs timeseries with complex information, stored as vector of structures, or a namedtuple of columns of different types (see StructArrays.jl).
Maybe there should be a different TimeTable type with heterogenous columns (similar to DataFrame), and TimeArray for a single column type, sharing the same timestamps from parent table?
More than that, individual columns can be a custom AbstractVector with some metadata for exotic element types. For example, if elements are encoded and metadata is needed to decode them on
getindex
:III. There is no separate implementation for timeseries with regular sample rate, that can be constrained to operations that produce a uniform sampling (similar to SampledSignals.jl). This type does not need to store materialized
timestamps
vector at all, since time can be calculated fromindex
,startdate
andsamplerate
(I call this a "time grid", which provides aindex2time
andtime2index
pair of functions). Timeseries remains uniform unless you want to take irregular / arbitraty samples from it - result is then converted to a common (non-uniform) timeseries with timestamps vector in it.IV. There are no timeseries with several timestamp columns. In my practive, I always have three different timeseries types: 1) series - regularly sampled timeseries 2) events - irregularly sampled timeseries with one timestamp value 3) segments - irregularly samples timeseries with two timestamp values (start, stop) for elements that have some extent in time.
There are several special cases for (3) with regard to indexing (what to do if I request time point inside the segment or time interval that partially overlap with segments on edges).
Maybe there can be even more exotic (or common) timeseries with more that two timestamps (each row is itself a repetition of some complex process in time with many "phases"), where you should explicitly choose, wich timestamp column you want to index by. But I would not complicate it that far.
V. Row indexing. You can index rows by:
What is missing:
time
andindex
positional arguments for different combinations).VI. Splitting by condition section has two different sets of functions:
where
in tables, but for timeseries (when
,findwhen
,findall
),from
,to
).VII. Maybe there should be some convention between functions that take and return timeseries, and functions that return standard vector types:
findwhen
vsfindall
;Also, there may be some methods to toggle between timeseries type - and underlying Table type, or standard array / vector of tuples. This is similar to Tables.columntable from DataFrames, they are using it to toggle between type-stable and compile-friendly cases.
VIII. Operation on single columns - or whole timeseries
this is very tricky part, because there is implicit inner join, and all columns should be the same numeric type. So maybe it should be applied only on a single column, or a single column can be modified this way inplace? This is also about heterogenity, as in section II above.
diff
,percentchange
,moving
,upto
with similar functions from any other package.basecall
looks strange to me - what if I want to run function not from Base, and run it on a single column, or a set of selected columns?IX. Combine methods
merge
naming instead of more commonjoin
?collapse
: AFAIK this is calleddecimation
orresampling
with another samplerate or time intervals - not onlyday
,week
, etc. Maybe even a vector of custom intervals. And there should be any arbitrary function, that canreduce
all elements that fall within each time interval (for example, you can get time distribution, if you count number of elements over a fixed time intervals)X. Customize TimeArray printing Can I choose a time string format to show, or is it chosen automatically based on - what? It would be nice to have examples for high-frequency timestamps in units of milliseconds.