JuliaStats / Roadmap.jl

A centralized location for planning the direction of JuliaStats
35 stars 3 forks source link

TimeSeries and TimeModels #6

Closed milktrader closed 10 years ago

milktrader commented 10 years ago

Great idea to centralize issues around package goals.

milktrader commented 10 years ago

My current thinking:

1) Combine TimeSeries and TimeModels into a single package 2) Adopt the SeriesPair immutable type in the un-registered Series.jl package as the data structure type 3) Offer support for NAs later in the development cycle, and use NaN initially (how important are NAs?) 4) Keep TimeSeries name for package 5) SeriesPair supports an index of any type 6) Provide a separate package to convert TimeSeries to DataFrames/DataArrays data structures.

carljv commented 10 years ago

Hey,

I'm about to come into a bunch of free time and would be interested on working on this (probably starting in 1-2 weeks). I do want to review the source code for zoo/xts in R and Series in Pandas for a bit though. Here are my thoughts on the roadmap:

  1. My inclination would be to keep them separate. I feel like a lot of people do work with time series data without actually doing time series analysis (arima models, spectral decomps, etc.) I feel like adding any complex analytics to the package would sort of privilege whatever you chose over whatever else people might come along with later. Basic stuff like rolling functions, etc, is fine. But I feel like statistical models should live in a separate space.
  2. Right now as I understand, SeriesPair is a generic type of indexed array? Does the following general hierarchy make sense (Indexed Array) :> (Ordered Indexed Array -- i.e. there's a meaningful sort order) :> (Time Series Indexed Array -- the index is a timestamp)? Maybe not, I'm just spitballing.
  3. I'm a little concerned about this forking off too far from DataArray. (So I'm confused about 6 below). Certainly if you want the benefits of alignability, it seems to me that NAs are super important. And it seems restrictive to me assume that NaN is going to cover it.
  4. Names are hard. I don't mind Series -- it seems general, but it's actually exactly what this is, and I don't know that it conflicts with anything else that might take that name. TimeSeries to me conflates the data and the statistics. Neither are ideal. I'll keep thinking.
  5. This makes sense, but I don't know if there are some invariants you'd need to enforce. Are there restrictions on what methods the Index's type should have to support operation's you'd do on the index? E.g., do you need to be able to sort and array of them?
  6. Is conversion just a matter of killing the index (or making it a column?) I'm not 100% sure what you mean by removing dependency. Maybe it makes sense that the indexed data be an array or a DataArray, but I think it makes a lot of sense to have it be a DataArray, and take advantage of what's developed there. (I think users will also expect that functionality they're used to from DataArrays and DataFrames carry over). Not just NA, but also the querying operations (some changes to which are still in the theoretical stage, I believe).

Like I said, I'd like to dig deeper into this starting in the next week or so, but these are my initial impressions.

cgroll commented 10 years ago

First, let me say that I think we need more than just one single implementation of time series data. In my opinion, at the very core we should have a package that suits a quite general set of needs, and that fits neatly into the framework provided by already existing packages. Maybe, this place just should be taken by your current design of TimeSeries, where time information is dealt with in the first column of a DataFrame. Personally, I do not like this design very much, as I don't think it is robust enough, and you can very much leverage power if you take into account the characteristics of time series at a more fundamental level. Nevertheless, I think that this is the solution that most Julia user would like to see, since I am sure that the DataFrames package will play a fundamental role in any mainstream data analysis.

1) I think that a package comprising fundamental time series operations AND models simultaneously may soon lead to a bloating up package that is hard to maintain. After all, each field has its own models. Nevertheless, with multiple implementations of time series data, you still would need kind of a common interface, such that do not need to implement each model for each data type.

2) Yes, we definitely need a type specifically constructed for time series data, but maybe not in the basic package.

3) In the very general version: yes! This will definitely be a feature that most mainstream users will like to see.

4) This name should be reserved for the most general package that provides a clear connection to DataFrames.

5) Good idea

6) Of course, we will need conversion between different types of time series implementations.

milktrader commented 10 years ago

@carljv cool that you have some time to think about this. I'm most familiar with xts/zoo, and have played around with some Pandas data structures. xts/zoo is an R matrix whose index is a valid Date type. It's actually quite simple and it's fast. xts author @lemnica has recently taken a hiatus from R development and is working on secret drones or something like that. He (Jeff Ryan) would likely be happy to chime in but I think he's predisposed as of late.

xts extends zoo, which was implemented to get away from the slow data.frames structure in R, and replace it with matrices indexed by date type. zoo author is European (Swiss?) professor Achim Zeileis and also very approachable.

My Series.jl package is a sort of awkward attempt to approach this implementation a bit differently. Here is the type:

immutable SeriesPair{T, V} <: AbstractSeriesPair
  index::T
  value::V
end

So essentially it represents a row of data. Methods are provided to work with an array of these instances. Sorting, working with time indexes, performing transformations upon (log returns, etc).

To play with it, you can clone the MarketData package from JuliaQuant organization. It requires Series so that also needs to be cloned. The MarketTechnicals package supports DataFrames in METADATA, but the Series data structure in the latest master branch.

Pkg.clone("https://github.com/milktrader/Series.jl.git")
Pkg.clone("https://github.com/JuliaQuant/MarketData.jl.git")

Once you get those installed, you can play around with some 3-year (cl) or 65-year (Cl) SPX daily closing prices.

julia> using Series, MarketData

julia> byyear(Cl, 1968) |> x -> bymonth(x,12) |> x -> byday(x,24)
1-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
 1968-12-24  105.0400

julia> Cl[date(1968,12,24)]
1968-12-24  105.0400

julia> ans.value = 105.00
ERROR: type SeriesPair is immutable

#cannot change the closing price on Dec 24, 1968

julia> mean(value(Cl))
436.97211140781224

#value() simply operates on the value element of the SeriesPairs in the array

julia> maximum(index(Cl))
2013-12-31

julia> Hi - Lo;

julia> ans[12345:12348]
4-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
 1999-01-25  14.5200
 1999-01-26  19.2700
 1999-01-27  19.7900
 1999-01-28  23.2300

And there is more. I haven't updated the README yet (it's admittedly a mess), but I have left a trail in the pull request history.

@carljv I'll attempt to give you access to TimeSeries and TimeModels so you can push up some branches and play around if you like.

milktrader commented 10 years ago

As to the name of the package and whether to keep TimeSeries or change it, here is a closed issue for some additional background https://github.com/JuliaLang/METADATA.jl/issues/472

Essentially, Series is a no-go. It connotes something completely different to mathematicians, and we have quite a few in the community. In fact, a new package named PowerSeries has just recently been registered.

The two most reasonable options are TimeSeries or DataSeries.

Prepending with Time suggests that the index is a Date type, at the exclusion of more general indexable types such as Integers. I like that the pandas Series data structure has take the more generalized approach. The R family of time-related packages haven't so. I'm undecided if this more generalized utility is nice and useful or just sorta nice to have.

And of course, a time-type data structure doesn't need to be hard-coded to time either. In fact, you may have your own time type that you'd prefer to index with. This opens the door to indexing with integers. I don't think it would be a stretch for someone interested in indexing with Integers to say to themselves, "hey, I think I'll use TimeSeries and simply substitute the time index with integers".

Using DataSeries as a name appears to solve some of these semantic issues. It does suggest a close affinity to the DataFrames/DataArrays data structure, which is where I start to get a little wary. I think the package should stand on its own and have as small a list of packages as possible in the REQUIRE file (as in zero, ideally)

milktrader commented 10 years ago

Thoughts about removing dependency to DataFrames/DataArrays.

The data structure for serialized data is different enough from the table-centered DataFrames/DataArrays that it should stand alone. This does lose the advantage of leveraging existing code, but I feel this distinction is important enough to forego that benefit.

There is also the important issue of a bloated REQUIRE file. What if I simply want to use TimeSeries but not DataFrames? With the dependency in REQUIRE in place, I'd be required to keep up to date not on just the TimeSeries package, but also DataFrames and DataArrays. And I would unnecessarily be bringing all that code into my project.

An example of how this affects packages downstream is the MarketTechnicals package, a library of technical analysis methods. Ideally, this package would only use a TimeSeries type for its methods. Sure, if you prefer to use a DataFrame there should certainly be at the very least a dataframe branch for the package but if you are designing a TradeModels package that requires MarketTechnicals and have no interested in using a DataFrame, why coerce the requirement?

To convert between a Time/DataSeries data structure and the DataFrames/DataArrays structure, I think a separate package would be the best option. This gives users who want this criss-cross the option, but doesn't force it upon those that don't.

milktrader commented 10 years ago

Just thinking aloud here. How about structure TimeSeries with two separate branches? One branch named dataframes that includes methods dispatched on the DataFrames/DataArrays data structure and another named dataseries that includes it's own type and methods dispatched on that type?

Whichever branch floats to the top as being used the most gets the honor of being named master.

milktrader commented 10 years ago

@cgroll I share your concern that a DataFrame with a first column Date type is not robust enough for time series. It was an early stage solution. I'm not really convinced that most Julia users who work with time series prefer it either. It's just that there isn't an alternative yet. Once an alternative data structure is available, I suspect that using time series with DataFrames will be a fringe case.

HarlanH commented 10 years ago

FWIW, when we originally designed the Julia DataFrames, the goal was very specifically not to try to support time series data in the same structure, as Pandas does. So I like the general plan here.

Other thoughts: Are you thinking the same or different data structures for regular vs irregular time series? Are there operations that are frequently performed on time series data that might affect data structure choices, such as frequent inserts/deletes from the middle of the table? We don't support that efficiently in DataFrames, but maybe you should? Note that using a row-oriented structure means that adding columns is inherently slow, and that memory usage and computational performance will be lower than a columnar structure. Adding rows to the end is faster, though. Do you want to include metadata in the structure for metric columns, such as units, or allowable aggregations? (counts aggregate with sum; prices aggregate with mean, presumably) What planning do you want to do for eventual memory-mapped, distributed, or immutable TS structures?

On Sat, Jan 25, 2014 at 9:31 AM, milktrader notifications@github.comwrote:

@cgroll https://github.com/cgroll I share your concern that a DataFrame with a first column Date type is not robust enough for time series. It was an early stage solution. I'm not really convinced that most Julia users who work with time series prefer it either. It's just that there isn't an alternative yet. Once an alternative data structure is available, I suspect that using time series with DataFrames will be a fringe case.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33290133 .

quinnj commented 10 years ago

Hey @milktrader, I may be wrong, but I believe your REQUIRE file is meant to keep only those dependencies for the current/master version of your package. For dependencies in older versions of the package, those can be handled through the METATA versions folder where you specify the SHA1 and specific dependencies for that release version.

milktrader commented 10 years ago

@HarlanH currently, I've only braved a single column of data associated with the SeriesPair array, similar to the Pandas Series approach. More than inserts into the structure, the main operation is alignments on rows and truncation (drop the first 9 rows used up for a 10-period ma, e.g.).

I think Datetime will sort out issues related to irregular rows. For example, suppose you have daily data. You'd like to collapse it to weekly data and have the last day of the week as the date element and the highest value during the week as the value element. Further suppose that you have one week that ends on a Thursday, unlike others that end on a Friday. The collapse method in Series.jl will gladly take Fridays when available and Thursdays if needed. I have set up tests for this but as I'm typing I'm not sure I've tested that scenario.

milktrader commented 10 years ago

@karbarcca yes indeed, I've conflated REQUIRE with calls to using. It's more of an issue that you need to call using DataFrames in the packages main file to be able to write methods dispatched on DataFrames, etc.

milktrader commented 10 years ago
julia> using Series, MarketData

julia> clw = collapse(cl, last, period=week);

julia> sum([dayofweek(clw[d].index) for d in 1:length(clw)] .== 4)
6

ie, there are 6 rows in clw whose date element is a Thursday (the code is a bit compact)

milktrader commented 10 years ago

Here, this might be better

julia> clw[dayofweek(index(clw)).==4]
6-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
 1980-04-03  102.1500
 1980-07-03  117.4600
 1981-04-16  134.7000
 1981-07-02  128.6400
 1981-12-24  122.5400
 1981-12-31  122.5500
milktrader commented 10 years ago

I should note that the most ideal TimeSeries type would be a Julian array whose index is not an Integer, but rather a time type. The chances that base would do this is likely near zero (though I haven't posited the idea yet). For this to happen in base, there would either need to be a backdoor to modify the row index type or a new TimeArray type.

Theoretically one could take the C code that Array is written in and modify it to accept time type as an alternate index to rows. I'm fairly certain this is what zoo did in R.

I'm not sure how this would work in a package. Now that Datetime is getting integrated into Base it might be worth an effort to explore this in a branch, calling it timearray, likely.

HarlanH commented 10 years ago

From a data structures and performance point of view, I'm not sure that this is quite right. I actually suspect that something closer to a B+ tree might be ideal for your needs. They're easy/fast to keep sorted as you insert/delete rows, they're very easy to iterate through, they can be merged faster than arrays, and the key can be anything orderable. Many database systems use B tree variants for indexes.

I've thought about using B+ trees for persistent DataFrames, with the key just being row number, but obviously haven't implemented anything.

On Sat, Jan 25, 2014 at 11:39 AM, milktrader notifications@github.comwrote:

I should note that the most ideal TimeSeries type would be a Julian array whose index is not an Integer, but rather a time type. The chances that base would do this is likely near zero (though I haven't posited the idea yet). For this to happen in base, there would either need to be a backdoor to modify the row index type or a new TimeArray type.

Theoretically one could take the C code that Array is written in and modify it to accept time type as an alternate index to rows. I'm fairly certain this is what zoo did in R.

I'm not sure how this would work in a package. Now that Datetime is getting integrated into Base it might be worth an effort to explore this in a branch, calling it timearray, likely.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293142 .

milktrader commented 10 years ago

Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?

HarlanH commented 10 years ago

I'm not up-to-date enough on this to recommend texts, sorry!

Thinking through something like an AbstractTimeSeries and its possible implementations might be a helpful process. Then you can retrofit your existing work into that framework and see how it feels, without blocking other implementations that may be better for various use-cases.

On Sat, Jan 25, 2014 at 12:07 PM, milktrader notifications@github.comwrote:

Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293841 .

HarlanH commented 10 years ago

Also. After writing up some sort of AbstractTimeSeries spec, and analyzing which operations are common and need to be fast (and which operations are rare and can be slow), it might be worth asking on julia-dev for help deciding on implementation details. There are definitely people there who are much better at that sort of thing than any of us contributing to this issue...

On Sat, Jan 25, 2014 at 12:17 PM, Harlan Harris harlan@harris.name wrote:

I'm not up-to-date enough on this to recommend texts, sorry!

Thinking through something like an AbstractTimeSeries and its possible implementations might be a helpful process. Then you can retrofit your existing work into that framework and see how it feels, without blocking other implementations that may be better for various use-cases.

On Sat, Jan 25, 2014 at 12:07 PM, milktrader notifications@github.comwrote:

Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend?

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/6#issuecomment-33293841 .

milktrader commented 10 years ago

I've pushed the current Series.jl repo to the seriespair branch of TimeSeries, for anyone interested in checking out the functionality with a branch checkout versus a Pkg.clone

nalimilan commented 10 years ago

If you want to index arrays using objects of an arbitrary type, then I think what you want is a NamedArray or an AssociativeArray. See https://github.com/davidavdav/NamedArrays (instead of strings, "names" would be dates). But as @HarlanH said, I don't think this is the best design at all (except maybe in some specific cases).

johnmyleswhite commented 10 years ago

I can't really keep up with this conversation (so feel free to ignore what I'm saying), but I would not use a NamedArray until you know that you need to hash indices to make things work. I can imagine that there are important special cases of time series data (specifically data that's regularly spaced in time) for which you could use dates as indices and make them super fast because you'd only need to do some simple arithmetic to translate dates into indices. That's going to be 10x faster than calling a hash function, which is doing at least 50 CPU instructions per index.

I think @HarlanH's original point was dead on: only worry about desired behaviors in the first pass and then get advice from the broader community about implementation. It's very easy to get the implementation wrong if you think you're going to support behaviors you don't need.

milktrader commented 10 years ago

I think I understand why NamedArrays is not an ideal solution here.

julia> using Datetime, NamedArrays

julia> n = NamedArray(rand(2,4));

julia> dates = [today(), today()+days(1)]
2-element Array{Date{ISOCalendar},1}:
 2014-01-27
 2014-01-28

julia> setnames!(n, dates, 1)
[2014-01-27=>1,2014-01-28=>2]

Is it because there is a mapping between 1 and an arbitrary object, in this case 2014-01-27? I think this is expensive, but need to think if it matters a lot or just a little.

johnmyleswhite commented 10 years ago

My point was much simpler (and potentially way less important): hashing costs a lot more than indexing using something with a trivial transformation into numbers. For constant interval time series, you can probably compute indices using something like index = CurrentDate - StartDate / PeriodDate. That's going to be a lot simpler than working with hash functions, which do a lot more arithmetic operations to produce an index.

The only point I'm sure of is that implementation shouldn't be the main concern: start with behaviors.

milktrader commented 10 years ago

Good point. https://github.com/JuliaStats/Roadmap.jl/issues/7

milktrader commented 10 years ago

I'd like to check-off whether TimeSeries and TimeModels should be one or two packages as settled that they should be separate. Early R packages did this but as the language matured it was abandoned in favor of bifurcation. @carljv has a good argument in favor of this at the beginning of the issue.

I'll check it off as settled in a few days, unless we have some objections.

milktrader commented 10 years ago

Also on the short list of check offs is the name TimeSeries. I think I'm about the only one that was reticent about this name and I'm no longer. I'll let it percolate for a little while.

milktrader commented 10 years ago

Just some notes about naming things and how the pieces fit together (subject to change of course)

TimeSeries is the package. TimePair is an immutable tuple-like type containing a date, value and name. TimeVector is an a 1-dimensional array of TimePairs that enforces constraints (such as no duplicate dates, all the names of each TimePair objects are identical, and others) TimeArray is a type that supports multicolumn arrays attached to a date element. TimeVectors can be combined into a TimeArray. TimeFrame is a type that supports multicolumn DataFrame attached to a date element. Since this type depends upon other packages, it will be a sub-module of TimeSeries and must be explicitly imported (i.e., using TimeSeries.frames). <- not sure this will be good enough.

milktrader commented 10 years ago

I'm going to close this as the checklist at the top is complete. It might be better to continue any ideas, thoughts, suggestions at the TimeSeries issues location. Thanks for all the input!