JuliaStats / Roadmap.jl

A centralized location for planning the direction of JuliaStats
35 stars 3 forks source link

TimeArray behaviors #7

Closed milktrader closed 10 years ago

milktrader commented 10 years ago

A time-series type (let's call it TimeArray) should have the following features:

And the following behavior:

ta[[date{1980, 1, 1):date(1980,1,31)]] ta[Tuesday]

ta["price_range"] = ta["high"] - ta["low"] # returns ta with new column named price_range ta["log_returns"] = percentchange(ta["Close"], method=log) # returns ta with new column named log_returns

price_range = TimeArray(ta["high"] - ta["low"], colname="price_range")

NOTE: this list is not static and I'll add to it.

nalimilan commented 10 years ago

Do you want to allow columns with different types, i.e. is TimeArray similar to DataFrame? If so, I'd say don't call it Array. (The design you describe sounds more similar to DataFrames to me.)

Also when you say

operations along rows should be fast operations along columns should be fast

I'm not sure what it means, except "everything should be as fast as possible". Usually one decides whether operations on rows or on columns should be the fastest (memory order). Saying that both should be fast OTC is quite vague.

HarlanH commented 10 years ago

Yes, I agree with Milan here. What sort of row and column operations need to be fast? Aggregations? Inserts? Appends? Windowing operations?

And it does seem to me that you want non-numerical (e.g., nominal covariate) columns, in addition to numeric ones. That implies a structure that's closer to a DataFrame but with a key column that's a date or time and is enforced to always be sorted. (Perhaps implemented as chunks indexed by a B tree, but that's a detail.)

On Mon, Jan 27, 2014 at 12:18 PM, Milan Bouchet-Valat < notifications@github.com> wrote:

Do you want to allow columns with different types, i.e. is TimeArraysimilar to DataFrame? If so, I'd say don't call it Array. (The design you describe sounds more similar to DataFrames to me.)

Also when you say

operations along rows should be fast operations along columns should be fast

I'm not sure what it means, except "everything should be as fast as possible". Usually one decides whether operations on rows or on columns should be the fastest (memory order). Saying that both should be fast OTC is quite vague.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/7#issuecomment-33397902 .

milktrader commented 10 years ago

I've edited the task list to omit the goal of fast column operations.

I'll let others chime in, but I think that rows are where most computation happens. Many times it's a particular day's data or a nearby date's data (that's where lag and lead come in) that gets analyzed. The day's range in prices is -(high,low) e.g.

Of course windowing is also of particular interest, and used in moving averages and moving standard deviations.

As to type of array data, I think it should behave as we expect Julian arrays to behave -- all elements of the array are of the same type.

When I was starting up the dream list I was thinking about the "Why we wrote Julia" call to wanting to have it all! Realistically, some tradeoffs are expected.

nalimilan commented 10 years ago

I think you shouldn't be too restrictive about what the types the columns should be. For a first implementation, it may be fine to require a single type, but people will inevitably need to use different types at some point.

Regarding rows/columns efficiency, different implementations may be useful if very different use cases arise. The general API should remain the same, though.

carljv commented 10 years ago

This discussion has gone through a lot of paces in the last week, and I'm not sure I'm totally caught up. As such, please disregard if this is all nonsense.

I feel like the descriptions in these discussions are reflecting two visions for the library. I think we need to decide whether TimeSeries is:

(1) A relatively lightweight type for doing classic time series analysis type stuff: data is all numeric, and you're doing ARIMA, GARCH, etc. or (2) It is a bigger, more ambitious stab at an Indexed DataFrame type (with the index being some conceptual notion of "time," indicating both the order of the observations and distance between them). This would facilitate more powerful tools for data manipulation based on the information provided in the index. This seems to be what @HarlanH is describing.

In (1), theres some room for argument about a dependency on DataFrames. If this is just a type that mainly exists to play nicely with TimeModels functions, then DataFrames may have uneccessary baggage.

For (2) a dependency on DataFrames seems a foregone conclusion. And I think something like this definitely should exist in Julia. There are a lot of operations that won't make sense with non-numeric data (some imputations, rolling arithmetic, downsampling arithmetic). But there's a lot of functionality you can apply to a timestamp index that is data-type agnostic: LOCF, alignment of binary ops between time-series, merges/joins, queries/indexing, etc.

And other behaviors should be decided differently between cases (1) and (2). For example your condition that rows can only be added for new timestamps makes a lot of sense in (1), but seems over-restrictive in (2).

It may make sense for these to be different endeavors. Ultimately I see (2) as a superset of (1) that would probably absorb it or deprecate it at some point. But if you want an up-and-running type for dealing with, e.g., financial data, and applying time series models, (1) may be the horse to ride at the moment. Just enough moving parts and a relatively simple API.

Again, as @johnmyleswhite and @HarlanH pointed out, I think a more detailed discussion and concensus on what we want the features and API to be is in order. Implementations will naturally fall out of that. For example, resampling and imputation aren't in your list right now, but are really useful features. I also find querying/indexing on time to be awkward in almost every language and library I use; so that's maybe something we could think about how to do better. But maybe that's overkill for the use cases you have in mind. I don't know.

milktrader commented 10 years ago

There was favorable input to include TimeData in METADATA, so this sort of begins to clear the air. The new package explicitly uses the DataFrames/DataArrays data structure in three types.

type Timedata <: AbstractTimedata
    vals::DataFrame
    dates::DataArray
# inner constructor enforces lengths of DataFrame(vals) & DataArray(dates) are equal
end

type Timenum <: AbstractTimenum
    vals::DataFrame
    dates::DataArray
# inner constructor as above but enforce values in DataFrame (vals) are subtype of abstract Number
end

type Timematr <: AbstractTimematr
    vals::DataFrame
    dates::DataArray
# inner constructor as above but enforce that DataFrame (vals) cannot contain NAs
end

This opens the door for TimeSeries to remove it's dependency on DataFrames/DataArrays (at least for the time being) and implement something along the lines of this:

type TimeArray{T,N}
  timestamp::Array{Date{ISOCalendar},1}
  values::Array{T,N}
  colnames::Array{String,1}

  # inner constructor to enforce length(timestamp) == length(values)
end
carljv commented 10 years ago

Agreed. Hopefully both packages will try to ensure that APIs are consistent between them for overlapping functionality.

From: milktrader notifications@github.com Reply-To: "JuliaStats/Roadmap.jl" <reply+i-26361926-c1b18d75c9f5ef2f14eb870585148d544a74d3be-1170842@reply.git hub.com> Date: Monday, February 3, 2014 at 10:32 AM To: "JuliaStats/Roadmap.jl" Roadmap.jl@noreply.github.com Cc: Carl Vogel carljv@gmail.com Subject: Re: [Roadmap.jl] TimeArray behaviors (#7)

There was favorable input to include TimeData in METADATA, so this sort of begins to clear the air. The new package explicitly uses the DataFrames/DataArrays data structure in three types.

type Timedata <: AbstractTimedata vals::DataFrame dates::DataArray# inner constructor enforces lengths of DataFrame(vals) & DataArray(dates) are equalendtype Timenum <: AbstractTimenum vals::DataFrame dates::DataArray# inner constructor as above but enforce values in DataFrame (vals) are subtype of abstract Numberendtype Timematr <: AbstractTimematr vals::DataFrame dates::DataArray# inner constructor as above but enforce that DataFrame (vals) cannot contain NAsend This opens the door for TimeSeries to remove it's dependency on DataFrames/DataArrays (at least for the time being) and implement something along the lines of this:

type TimeArray{T,N} timestamp::Array{Date{ISOCalendar},1} values::Array{T,N} colnames::Array{String,1}

inner constructor to enforce length(timestamp) == length(values)end

‹ Reply to this email directly or view it on GitHub https://github.com/JuliaStats/Roadmap.jl/issues/7#issuecomment-33965546 .

milktrader commented 10 years ago

This README is changing as I add features to the prototype, but it meets some of the goals listed at the top. https://github.com/milktrader/TimeArrays.jl/blob/master/README.md

Specifically these features:

And this behavior:

To save time in referring to the linked README (which changes quite a bit anyway), here are the examples:

julia> using TimeArrays, MarketData

julia> ohlc = TimeArray(op, hi, lo, cl); # construct TimeArray from SeriesPair objects

julia> ohlc.colnames = ["Open", "High", "Low", "Close"]; # over-ride the default "value" names

julia> ohlc[10]
1x4 Array{Float64,2} 1980-01-16 to 1980-01-16

             Open       High    Low     Close
1980-01-16 | 111.14     112.9   110.38  111.05

julia> ohlc[1:2]
2x4 Array{Float64,2} 1980-01-03 to 1980-01-04

             Open       High    Low     Close
1980-01-03 | 105.76     106.08  103.26  105.22
1980-01-04 | 105.22     107.08  105.09  106.52

julia> ohlc[[1,2,10]]
3x4 Array{Float64,2} 1980-01-03 to 1980-01-16

             Open       High    Low     Close
1980-01-03 | 105.76     106.08  103.26  105.22
1980-01-04 | 105.22     107.08  105.09  106.52
1980-01-16 | 111.14     112.9   110.38  111.05

julia> firstday, tenthday
(1980-01-03,1980-01-16)

julia> ohlc[firstday]
1x4 Array{Float64,2} 1980-01-03 to 1980-01-03

             Open       High    Low     Close
1980-01-03 | 105.76     106.08  103.26  105.22

julia> ohlc[firstday:days(5):tenthday]
2x4 Array{Float64,2} 1980-01-03 to 1980-01-08

             Open       High    Low     Close
1980-01-03 | 105.76     106.08  103.26  105.22
1980-01-08 | 106.81     109.29  106.29  108.95

julia> ohlc[[firstday, secondday, tenthday]]
3x4 Array{Float64,2} 1980-01-03 to 1980-01-16

             Open       High    Low     Close
1980-01-03 | 105.76     106.08  103.26  105.22
1980-01-04 | 105.22     107.08  105.09  106.52
1980-01-16 | 111.14     112.9   110.38  111.05

julia> ohlc["Open"]
505x1 Array{Float64,1} 1980-01-03 to 1981-12-31

             Open
1980-01-03 | 105.76
1980-01-04 | 105.22
1980-01-07 | 106.52
1980-01-08 | 106.81
...
1981-12-24 | 122.31
1981-12-28 | 122.54
1981-12-29 | 122.27
1981-12-30 | 121.67
1981-12-31 | 122.3

julia> ohlc["Open", "Close"]
505x2 Array{Float64,2} 1980-01-03 to 1981-12-31

             Open       Close
1980-01-03 | 105.76     105.22
1980-01-04 | 105.22     106.52
1980-01-07 | 106.52     106.81
1980-01-08 | 106.81     108.95
...
1981-12-24 | 122.31     122.54
1981-12-28 | 122.54     122.27
1981-12-29 | 122.27     121.67
1981-12-30 | 121.67     122.3
1981-12-31 | 122.3      122.55
milktrader commented 10 years ago

You can also get only Tuesdays

julia> tue = x -> x[dayofweek(x) .== 2];

julia> ohlc[tue(ohlc.timestamp)]
103x4 Array{Float64,2} 1980-01-08 to 1981-12-29

             Open       High    Low     Close
1980-01-08 | 106.81     109.29  106.29  108.95
1980-01-15 | 110.38     111.93  109.45  111.14
1980-01-22 | 112.1      113.1   110.92  111.51
1980-01-29 | 114.85     115.77  113.03  114.07
...
1981-12-01 | 126.35     127.3   124.84  126.1
1981-12-08 | 125.19     125.75  123.52  124.82
1981-12-15 | 122.78     123.78  121.83  122.99
1981-12-22 | 123.34     124.17  122.19  122.88
1981-12-29 | 122.27     122.9   121.12  121.67

julia> run(`cal 1 1980`)
    January 1980
Su Mo Tu We Th Fr Sa
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31

Bonus points awarded for anonymous function that defines the day after Thanksgiving. Hmm, low volume probably and likely an up day. What are the odds that if you buy the market two tics below the close on the day before Thanksgiving and sell it two tics below the close the day after you'd experience a profit?

quinnj commented 10 years ago

That will hopefully be made even easier once enum support drops in Base. Days of the week will then be defined as an enum (DAYOFWEEK => Sunday, Monday, Tuesday,...) and you can define

getindex(A::TimeArray, d::DAYOFWEEK) = A[dayofweek(A) .== d]

and then just run

ohlc[Tuesday]
milktrader commented 10 years ago

@karbarcca that will be an awesome add! Did you notice my bonus question add above? I'll stub out that getindex method

milktrader commented 10 years ago

This will also require a getindex method on BitArray but I needed that one anyway.

milktrader commented 10 years ago

Here is the inner constructor for TimeArray, which enforces invariants. I think I've got a reasonable suite of checks going here, at least enough to catch inadvertent mistakes, versus intentional breakage

  function TimeArray(timestamp::Vector{Date{ISOCalendar}}, values::Array{T,N}, colnames::Vector{ASCIIString})
    nrow, ncol = size(values, 1), size(values, 2)
    nrow != size(timestamp, 1) ? error("values must match length of timestamp"):
    ncol != size(colnames,1) ? error("column names must match width of array"):
    timestamp != unique(timestamp) ? error("there are duplicate dates"):
    ~(flipud(timestamp) == sort(timestamp) || timestamp == sort(timestamp)) ? error("dates are mangled"):
    flipud(timestamp) == sort(timestamp) ? 
    new(flipud(timestamp), flipud(values), colnames):
    new(timestamp, values, colnames)
  end
quinnj commented 10 years ago

Not sure about the odds on making profit ;) but to get the day after thanksgiving, you can just do

recur(Date(2000,1,1),Date(2015,1,1)) do x
    month(x) == November &&
    dayofweek(x) == Friday &&
    dayofweekofmonth(x) == 4
end

That is, we want the 4th Friday in November from 2000-01-01 through 2015-01-01 (given that Thanksgiving is the 4th Thursday of November). Note that this format is the current implementation in the Base PR.

milktrader commented 10 years ago

I really like that API for sharpshooting a time slot. I am getting an error though. Is that because I'm using the current Datetime package and not the PR?

julia> recur(Date(2000,1,1),Date(2015,1,1)) do x
           month(x) == November &&
           dayofweek(x) == Friday &&
           dayofweekofmonth(x) == 4
       end
ERROR: type cannot be constructed
quinnj commented 10 years ago

Yeah, ranges aren't implemented yet in the Base PR (due to the rangeopocalypse). The new recur will just return an array of Dates anyway instead of a range object. In the current Datetime, you can do

t = [recur(date(2000,1,1):date(2015,1,1)) do x
  month(x) == November &&
  dayofweek(x) == Friday && 
  dayofweekinmonth(x) == 4
end]
milktrader commented 10 years ago

To achieve the goal of containing NAs, it might be best to look into DataArrays. I haven't completely grokked why the separation of packages, but a google groups reply by @johnmyleswhite on julia-users seems to shed some insight:

And in the introduction to DataFrames it specifically mentions that DataArray is an efficient Julian array but with NAs. This suggests the following definition for TimeArray

immutable TimeArray{T,N}
  timestamp::Vector{Date{ISOCalendar}}
  values::DataArray{T,N} 
  # batteries included for NAs and column names with DataArray

  # inner constructor to enforce obvious invariants (lengths of elements are equal, etc)
end

TimeVector is an alias for a TimeArray whose value is a DataVector (similar to Pandas Series) and TimeMatrix is an alias for TimeArray whose value is a DataMatrix.

This involves a complete throwing in the trash can of most of my hacking code, which is fine with me. At this point, I'm ready to abandon the SeriesPair idea. It adds a layer of complexity and is slower than the newest TimeArray brainstorm. I'm still interested in going down the original TimeArray path that the TimeArray.jl package implements.

Thoughts?

I can start a branch in TimeSeries named DataArray (I'd like to reserve TimeArray for the time being since I still have a package by that name). I'll set up the skeleton and stub out the tests (I'm using the FactCheck package lately (it's a little rspec and a little testthat) and anyone can make intuitive sense of the workflow there. If not, we can open an issue to discuss it.

If someone else wants to take on this branch idea, please do! I can spend a little more time on the original TimeArray data structure in the mean time. I'll start by opening an issue on TimeSeries.

https://github.com/JuliaStats/TimeSeries.jl/issues/52

milktrader commented 10 years ago

@karbarcca aha, that's what I was missing. I really like that it returns an Array{Date{ISOCalendar},1}, which should make subsetting time series data intuitive.

quinnj commented 10 years ago

I think leveraging the DataArray package is an excellent approach. It should be a solid, efficient package ready to be leveraged for exactly this kind of thing.

Yeah, returning a range object was a little weird. I think we can still find an efficient solution for iteration too. I'm enjoying seeing the progress of time series code. I wish I had more time to dive in, but school's a little crazy right now. I'll try to poke around when I can.

milktrader commented 10 years ago

Okay, jump in any time you'd like @karbarcca. I'll give you commit access.

nalimilan commented 10 years ago

Yeah, that's a good idea, but I think you should remain more general. TimeArray should support standard Arrays too, for cases where you do not want NAs and a higher performance.

But more importantly, I still believe you should be able to associate variables of different types to a date, i.e. like a DataFrame.

milktrader commented 10 years ago

There might be room for both a TimeArray and a TimeDataArray, if the performance is indeed noticeable. The other type that leverages the database structure of DataFrames could be called TimeFrame.

These three types mimic what @cgroll is doing in the TimeData package with Timedata, Timenum and Timematr

nalimilan commented 10 years ago

Yeah, but you probably don't care what's the actual type of the underlying array. I guess you could do everything you need with only the AbstractArray interface, so you can support both Arrays and DataArrays in the same structure.

milktrader commented 10 years ago

I'm closing this for now. Let's continue discussing these issues at the TimeSeries repo issues page. Thanks for all the input so far!