Closed milktrader closed 10 years ago
My current thinking:
1) Combine TimeSeries and TimeModels into a single package
2) Adopt the SeriesPair
immutable type in the un-registered Series.jl package as the data structure type
3) Offer support for NA
s later in the development cycle, and use NaN
initially (how important are NA
s?)
4) Keep TimeSeries name for package
5) SeriesPair
supports an index of any type
6) Provide a separate package to convert TimeSeries to DataFrames/DataArrays data structures.
Hey,
I'm about to come into a bunch of free time and would be interested on working on this (probably starting in 1-2 weeks). I do want to review the source code for zoo/xts in R and Series in Pandas for a bit though. Here are my thoughts on the roadmap:
Like I said, I'd like to dig deeper into this starting in the next week or so, but these are my initial impressions.
First, let me say that I think we need more than just one single implementation of time series data. In my opinion, at the very core we should have a package that suits a quite general set of needs, and that fits neatly into the framework provided by already existing packages. Maybe, this place just should be taken by your current design of TimeSeries
, where time information is dealt with in the first column of a DataFrame
. Personally, I do not like this design very much, as I don't think it is robust enough, and you can very much leverage power if you take into account the characteristics of time series at a more fundamental level. Nevertheless, I think that this is the solution that most Julia user would like to see, since I am sure that the DataFrames
package will play a fundamental role in any mainstream data analysis.
1) I think that a package comprising fundamental time series operations AND models simultaneously may soon lead to a bloating up package that is hard to maintain. After all, each field has its own models. Nevertheless, with multiple implementations of time series data, you still would need kind of a common interface, such that do not need to implement each model for each data type.
2) Yes, we definitely need a type specifically constructed for time series data, but maybe not in the basic package.
3) In the very general version: yes! This will definitely be a feature that most mainstream users will like to see.
4) This name should be reserved for the most general package that provides a clear connection to DataFrames
.
5) Good idea
6) Of course, we will need conversion between different types of time series implementations.
@carljv cool that you have some time to think about this. I'm most familiar with xts/zoo
, and have played around with some Pandas data structures. xts/zoo
is an R matrix whose index is a valid Date type. It's actually quite simple and it's fast. xts
author @lemnica has recently taken a hiatus from R development and is working on secret drones or something like that. He (Jeff Ryan) would likely be happy to chime in but I think he's predisposed as of late.
xts
extends zoo
, which was implemented to get away from the slow data.frames structure in R, and replace it with matrices indexed by date type. zoo
author is European (Swiss?) professor Achim Zeileis and also very approachable.
My Series.jl package is a sort of awkward attempt to approach this implementation a bit differently. Here is the type:
immutable SeriesPair{T, V} <: AbstractSeriesPair
index::T
value::V
end
So essentially it represents a row of data. Methods are provided to work with an array of these instances. Sorting, working with time indexes, performing transformations upon (log returns, etc).
To play with it, you can clone the MarketData
package from JuliaQuant organization. It requires Series
so that also needs to be cloned. The MarketTechnicals package supports DataFrames in METADATA, but the Series data structure in the latest master branch.
Pkg.clone("https://github.com/milktrader/Series.jl.git")
Pkg.clone("https://github.com/JuliaQuant/MarketData.jl.git")
Once you get those installed, you can play around with some 3-year (cl) or 65-year (Cl) SPX daily closing prices.
julia> using Series, MarketData
julia> byyear(Cl, 1968) |> x -> bymonth(x,12) |> x -> byday(x,24)
1-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1968-12-24 105.0400
julia> Cl[date(1968,12,24)]
1968-12-24 105.0400
julia> ans.value = 105.00
ERROR: type SeriesPair is immutable
#cannot change the closing price on Dec 24, 1968
julia> mean(value(Cl))
436.97211140781224
#value() simply operates on the value element of the SeriesPairs in the array
julia> maximum(index(Cl))
2013-12-31
julia> Hi - Lo;
julia> ans[12345:12348]
4-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1999-01-25 14.5200
1999-01-26 19.2700
1999-01-27 19.7900
1999-01-28 23.2300
And there is more. I haven't updated the README yet (it's admittedly a mess), but I have left a trail in the pull request history.
@carljv I'll attempt to give you access to TimeSeries and TimeModels so you can push up some branches and play around if you like.
As to the name of the package and whether to keep TimeSeries or change it, here is a closed issue for some additional background https://github.com/JuliaLang/METADATA.jl/issues/472
Essentially, Series is a no-go. It connotes something completely different to mathematicians, and we have quite a few in the community. In fact, a new package named PowerSeries has just recently been registered.
The two most reasonable options are TimeSeries or DataSeries.
Prepending with Time suggests that the index is a Date type, at the exclusion of more general indexable types such as Integers. I like that the pandas Series
data structure has take the more generalized approach. The R family of time-related packages haven't so. I'm undecided if this more generalized utility is nice and useful or just sorta nice to have.
And of course, a time-type data structure doesn't need to be hard-coded to time either. In fact, you may have your own time type that you'd prefer to index with. This opens the door to indexing with integers. I don't think it would be a stretch for someone interested in indexing with Integers to say to themselves, "hey, I think I'll use TimeSeries and simply substitute the time index with integers".
Using DataSeries as a name appears to solve some of these semantic issues. It does suggest a close affinity to the DataFrames/DataArrays data structure, which is where I start to get a little wary. I think the package should stand on its own and have as small a list of packages as possible in the REQUIRE file (as in zero, ideally)
Thoughts about removing dependency to DataFrames/DataArrays.
The data structure for serialized data is different enough from the table-centered DataFrames/DataArrays that it should stand alone. This does lose the advantage of leveraging existing code, but I feel this distinction is important enough to forego that benefit.
There is also the important issue of a bloated REQUIRE file. What if I simply want to use TimeSeries but not DataFrames? With the dependency in REQUIRE in place, I'd be required to keep up to date not on just the TimeSeries package, but also DataFrames and DataArrays. And I would unnecessarily be bringing all that code into my project.
An example of how this affects packages downstream is the MarketTechnicals package, a library of technical analysis methods. Ideally, this package would only use a TimeSeries type for its methods. Sure, if you prefer to use a DataFrame there should certainly be at the very least a dataframe
branch for the package but if you are designing a TradeModels
package that requires MarketTechnicals and have no interested in using a DataFrame, why coerce the requirement?
To convert between a Time/DataSeries data structure and the DataFrames/DataArrays structure, I think a separate package would be the best option. This gives users who want this criss-cross the option, but doesn't force it upon those that don't.
Just thinking aloud here. How about structure TimeSeries with two separate branches? One branch named dataframes
that includes methods dispatched on the DataFrames/DataArrays data structure and another named dataseries
that includes it's own type and methods dispatched on that type?
Whichever branch floats to the top as being used the most gets the honor of being named master
.
@cgroll I share your concern that a DataFrame with a first column Date type is not robust enough for time series. It was an early stage solution. I'm not really convinced that most Julia users who work with time series prefer it either. It's just that there isn't an alternative yet. Once an alternative data structure is available, I suspect that using time series with DataFrames will be a fringe case.
FWIW, when we originally designed the Julia DataFrames, the goal was very specifically not to try to support time series data in the same structure, as Pandas does. So I like the general plan here.
Other thoughts: Are you thinking the same or different data structures for regular vs irregular time series? Are there operations that are frequently performed on time series data that might affect data structure choices, such as frequent inserts/deletes from the middle of the table? We don't support that efficiently in DataFrames, but maybe you should? Note that using a row-oriented structure means that adding columns is inherently slow, and that memory usage and computational performance will be lower than a columnar structure. Adding rows to the end is faster, though. Do you want to include metadata in the structure for metric columns, such as units, or allowable aggregations? (counts aggregate with sum; prices aggregate with mean, presumably) What planning do you want to do for eventual memory-mapped, distributed, or immutable TS structures?
On Sat, Jan 25, 2014 at 9:31 AM, milktrader notifications@github.comwrote:
@cgroll https://github.com/cgroll I share your concern that a DataFrame with a first column Date type is not robust enough for time series. It was an early stage solution. I'm not really convinced that most Julia users who work with time series prefer it either. It's just that there isn't an alternative yet. Once an alternative data structure is available, I suspect that using time series with DataFrames will be a fringe case.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33290133 .
Hey @milktrader, I may be wrong, but I believe your REQUIRE file is meant to keep only those dependencies for the current/master version of your package. For dependencies in older versions of the package, those can be handled through the METATA versions folder where you specify the SHA1 and specific dependencies for that release version.
@HarlanH currently, I've only braved a single column of data associated with the SeriesPair
array, similar to the Pandas Series
approach. More than inserts into the structure, the main operation is alignments on rows and truncation (drop the first 9 rows used up for a 10-period ma, e.g.).
I think Datetime will sort out issues related to irregular rows. For example, suppose you have daily data. You'd like to collapse it to weekly data and have the last day of the week as the date element and the highest value during the week as the value element. Further suppose that you have one week that ends on a Thursday, unlike others that end on a Friday. The collapse
method in Series.jl will gladly take Fridays when available and Thursdays if needed. I have set up tests for this but as I'm typing I'm not sure I've tested that scenario.
@karbarcca yes indeed, I've conflated REQUIRE with calls to using
. It's more of an issue that you need to call using DataFrames
in the packages main file to be able to write methods dispatched on DataFrames, etc.
julia> using Series, MarketData
julia> clw = collapse(cl, last, period=week);
julia> sum([dayofweek(clw[d].index) for d in 1:length(clw)] .== 4)
6
ie, there are 6 rows in clw
whose date element is a Thursday (the code is a bit compact)
Here, this might be better
julia> clw[dayofweek(index(clw)).==4]
6-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1980-04-03 102.1500
1980-07-03 117.4600
1981-04-16 134.7000
1981-07-02 128.6400
1981-12-24 122.5400
1981-12-31 122.5500
I should note that the most ideal TimeSeries type would be a Julian array whose index is not an Integer, but rather a time type. The chances that base would do this is likely near zero (though I haven't posited the idea yet). For this to happen in base, there would either need to be a backdoor to modify the row index type or a new TimeArray type.
Theoretically one could take the C code that Array is written in and modify it to accept time type as an alternate index to rows. I'm fairly certain this is what zoo
did in R.
I'm not sure how this would work in a package. Now that Datetime is getting integrated into Base it might be worth an effort to explore this in a branch, calling it timearray
, likely.
From a data structures and performance point of view, I'm not sure that this is quite right. I actually suspect that something closer to a B+ tree might be ideal for your needs. They're easy/fast to keep sorted as you insert/delete rows, they're very easy to iterate through, they can be merged faster than arrays, and the key can be anything orderable. Many database systems use B tree variants for indexes.
I've thought about using B+ trees for persistent DataFrames, with the key just being row number, but obviously haven't implemented anything.
On Sat, Jan 25, 2014 at 11:39 AM, milktrader notifications@github.comwrote:
I should note that the most ideal TimeSeries type would be a Julian array whose index is not an Integer, but rather a time type. The chances that base would do this is likely near zero (though I haven't posited the idea yet). For this to happen in base, there would either need to be a backdoor to modify the row index type or a new TimeArray type.
Theoretically one could take the C code that Array is written in and modify it to accept time type as an alternate index to rows. I'm fairly certain this is what zoo did in R.
I'm not sure how this would work in a package. Now that Datetime is getting integrated into Base it might be worth an effort to explore this in a branch, calling it timearray, likely.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293142 .
Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?
I'm not up-to-date enough on this to recommend texts, sorry!
Thinking through something like an AbstractTimeSeries and its possible implementations might be a helpful process. Then you can retrofit your existing work into that framework and see how it feels, without blocking other implementations that may be better for various use-cases.
On Sat, Jan 25, 2014 at 12:07 PM, milktrader notifications@github.comwrote:
Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293841 .
Also. After writing up some sort of AbstractTimeSeries spec, and analyzing which operations are common and need to be fast (and which operations are rare and can be slow), it might be worth asking on julia-dev for help deciding on implementation details. There are definitely people there who are much better at that sort of thing than any of us contributing to this issue...
On Sat, Jan 25, 2014 at 12:17 PM, Harlan Harris harlan@harris.name wrote:
I'm not up-to-date enough on this to recommend texts, sorry!
Thinking through something like an AbstractTimeSeries and its possible implementations might be a helpful process. Then you can retrofit your existing work into that framework and see how it feels, without blocking other implementations that may be better for various use-cases.
On Sat, Jan 25, 2014 at 12:07 PM, milktrader notifications@github.comwrote:
Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293841 .
I've pushed the current Series.jl repo to the seriespair
branch of TimeSeries, for anyone interested in checking out the functionality with a branch checkout versus a Pkg.clone
If you want to index arrays using objects of an arbitrary type, then I think what you want is a NamedArray
or an AssociativeArray
. See https://github.com/davidavdav/NamedArrays (instead of strings, "names" would be dates). But as @HarlanH said, I don't think this is the best design at all (except maybe in some specific cases).
I can't really keep up with this conversation (so feel free to ignore what I'm saying), but I would not use a NamedArray
until you know that you need to hash indices to make things work. I can imagine that there are important special cases of time series data (specifically data that's regularly spaced in time) for which you could use dates as indices and make them super fast because you'd only need to do some simple arithmetic to translate dates into indices. That's going to be 10x faster than calling a hash function, which is doing at least 50 CPU instructions per index.
I think @HarlanH's original point was dead on: only worry about desired behaviors in the first pass and then get advice from the broader community about implementation. It's very easy to get the implementation wrong if you think you're going to support behaviors you don't need.
I think I understand why NamedArrays
is not an ideal solution here.
julia> using Datetime, NamedArrays
julia> n = NamedArray(rand(2,4));
julia> dates = [today(), today()+days(1)]
2-element Array{Date{ISOCalendar},1}:
2014-01-27
2014-01-28
julia> setnames!(n, dates, 1)
[2014-01-27=>1,2014-01-28=>2]
Is it because there is a mapping between 1
and an arbitrary object, in this case 2014-01-27
? I think this is expensive, but need to think if it matters a lot or just a little.
My point was much simpler (and potentially way less important): hashing costs a lot more than indexing using something with a trivial transformation into numbers. For constant interval time series, you can probably compute indices using something like index = CurrentDate - StartDate / PeriodDate
. That's going to be a lot simpler than working with hash functions, which do a lot more arithmetic operations to produce an index.
The only point I'm sure of is that implementation shouldn't be the main concern: start with behaviors.
I'd like to check-off whether TimeSeries and TimeModels should be one or two packages as settled that they should be separate. Early R packages did this but as the language matured it was abandoned in favor of bifurcation. @carljv has a good argument in favor of this at the beginning of the issue.
I'll check it off as settled in a few days, unless we have some objections.
Also on the short list of check offs is the name TimeSeries. I think I'm about the only one that was reticent about this name and I'm no longer. I'll let it percolate for a little while.
Just some notes about naming things and how the pieces fit together (subject to change of course)
TimeSeries
is the package.
TimePair
is an immutable tuple-like type containing a date, value and name.
TimeVector
is an a 1-dimensional array of TimePair
s that enforces constraints (such as no duplicate dates, all the names of each TimePair
objects are identical, and others)
TimeArray
is a type that supports multicolumn arrays attached to a date element. TimeVector
s can be combined into a TimeArray
.
TimeFrame
is a type that supports multicolumn DataFrame attached to a date element. Since this type depends upon other packages, it will be a sub-module of TimeSeries
and must be explicitly imported (i.e., using TimeSeries.frames
). <- not sure this will be good enough.
I'm going to close this as the checklist at the top is complete. It might be better to continue any ideas, thoughts, suggestions at the TimeSeries issues location. Thanks for all the input!
Great idea to centralize issues around package goals.
NA
in new type => Yes, it should be supported{ISOCalendar}
for now