Closed milktrader closed 10 years ago
Do you want to allow columns with different types, i.e. is TimeArray
similar to DataFrame
? If so, I'd say don't call it Array
. (The design you describe sounds more similar to DataFrames
to me.)
Also when you say
operations along rows should be fast operations along columns should be fast
I'm not sure what it means, except "everything should be as fast as possible". Usually one decides whether operations on rows or on columns should be the fastest (memory order). Saying that both should be fast OTC is quite vague.
Yes, I agree with Milan here. What sort of row and column operations need to be fast? Aggregations? Inserts? Appends? Windowing operations?
And it does seem to me that you want non-numerical (e.g., nominal covariate) columns, in addition to numeric ones. That implies a structure that's closer to a DataFrame but with a key column that's a date or time and is enforced to always be sorted. (Perhaps implemented as chunks indexed by a B tree, but that's a detail.)
On Mon, Jan 27, 2014 at 12:18 PM, Milan Bouchet-Valat < notifications@github.com> wrote:
Do you want to allow columns with different types, i.e. is TimeArraysimilar to DataFrame? If so, I'd say don't call it Array. (The design you describe sounds more similar to DataFrames to me.)
Also when you say
operations along rows should be fast operations along columns should be fast
I'm not sure what it means, except "everything should be as fast as possible". Usually one decides whether operations on rows or on columns should be the fastest (memory order). Saying that both should be fast OTC is quite vague.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Roadmap.jl/issues/7#issuecomment-33397902 .
I've edited the task list to omit the goal of fast column operations.
I'll let others chime in, but I think that rows are where most computation happens. Many times it's a particular day's data or a nearby date's data (that's where lag and lead come in) that gets analyzed. The day's range in prices is -(high,low)
e.g.
Of course windowing is also of particular interest, and used in moving averages and moving standard deviations.
As to type of array data, I think it should behave as we expect Julian arrays to behave -- all elements of the array are of the same type.
When I was starting up the dream list I was thinking about the "Why we wrote Julia" call to wanting to have it all! Realistically, some tradeoffs are expected.
I think you shouldn't be too restrictive about what the types the columns should be. For a first implementation, it may be fine to require a single type, but people will inevitably need to use different types at some point.
Regarding rows/columns efficiency, different implementations may be useful if very different use cases arise. The general API should remain the same, though.
This discussion has gone through a lot of paces in the last week, and I'm not sure I'm totally caught up. As such, please disregard if this is all nonsense.
I feel like the descriptions in these discussions are reflecting two visions for the library. I think we need to decide whether TimeSeries is:
(1) A relatively lightweight type for doing classic time series analysis type stuff: data is all numeric, and you're doing ARIMA, GARCH, etc. or (2) It is a bigger, more ambitious stab at an Indexed DataFrame type (with the index being some conceptual notion of "time," indicating both the order of the observations and distance between them). This would facilitate more powerful tools for data manipulation based on the information provided in the index. This seems to be what @HarlanH is describing.
In (1), theres some room for argument about a dependency on DataFrames. If this is just a type that mainly exists to play nicely with TimeModels functions, then DataFrames may have uneccessary baggage.
For (2) a dependency on DataFrames seems a foregone conclusion. And I think something like this definitely should exist in Julia. There are a lot of operations that won't make sense with non-numeric data (some imputations, rolling arithmetic, downsampling arithmetic). But there's a lot of functionality you can apply to a timestamp index that is data-type agnostic: LOCF, alignment of binary ops between time-series, merges/joins, queries/indexing, etc.
And other behaviors should be decided differently between cases (1) and (2). For example your condition that rows can only be added for new timestamps makes a lot of sense in (1), but seems over-restrictive in (2).
It may make sense for these to be different endeavors. Ultimately I see (2) as a superset of (1) that would probably absorb it or deprecate it at some point. But if you want an up-and-running type for dealing with, e.g., financial data, and applying time series models, (1) may be the horse to ride at the moment. Just enough moving parts and a relatively simple API.
Again, as @johnmyleswhite and @HarlanH pointed out, I think a more detailed discussion and concensus on what we want the features and API to be is in order. Implementations will naturally fall out of that. For example, resampling and imputation aren't in your list right now, but are really useful features. I also find querying/indexing on time to be awkward in almost every language and library I use; so that's maybe something we could think about how to do better. But maybe that's overkill for the use cases you have in mind. I don't know.
There was favorable input to include TimeData
in METADATA, so this sort of begins to clear the air. The new package explicitly uses the DataFrames/DataArrays data structure in three types.
type Timedata <: AbstractTimedata
vals::DataFrame
dates::DataArray
# inner constructor enforces lengths of DataFrame(vals) & DataArray(dates) are equal
end
type Timenum <: AbstractTimenum
vals::DataFrame
dates::DataArray
# inner constructor as above but enforce values in DataFrame (vals) are subtype of abstract Number
end
type Timematr <: AbstractTimematr
vals::DataFrame
dates::DataArray
# inner constructor as above but enforce that DataFrame (vals) cannot contain NAs
end
This opens the door for TimeSeries to remove it's dependency on DataFrames/DataArrays (at least for the time being) and implement something along the lines of this:
type TimeArray{T,N}
timestamp::Array{Date{ISOCalendar},1}
values::Array{T,N}
colnames::Array{String,1}
# inner constructor to enforce length(timestamp) == length(values)
end
Agreed. Hopefully both packages will try to ensure that APIs are consistent between them for overlapping functionality.
From: milktrader notifications@github.com Reply-To: "JuliaStats/Roadmap.jl" <reply+i-26361926-c1b18d75c9f5ef2f14eb870585148d544a74d3be-1170842@reply.git hub.com> Date: Monday, February 3, 2014 at 10:32 AM To: "JuliaStats/Roadmap.jl" Roadmap.jl@noreply.github.com Cc: Carl Vogel carljv@gmail.com Subject: Re: [Roadmap.jl] TimeArray behaviors (#7)
There was favorable input to include TimeData in METADATA, so this sort of begins to clear the air. The new package explicitly uses the DataFrames/DataArrays data structure in three types.
type Timedata <: AbstractTimedata vals::DataFrame dates::DataArray# inner constructor enforces lengths of DataFrame(vals) & DataArray(dates) are equalendtype Timenum <: AbstractTimenum vals::DataFrame dates::DataArray# inner constructor as above but enforce values in DataFrame (vals) are subtype of abstract Numberendtype Timematr <: AbstractTimematr vals::DataFrame dates::DataArray# inner constructor as above but enforce that DataFrame (vals) cannot contain NAsend This opens the door for TimeSeries to remove it's dependency on DataFrames/DataArrays (at least for the time being) and implement something along the lines of this:
type TimeArray{T,N} timestamp::Array{Date{ISOCalendar},1} values::Array{T,N} colnames::Array{String,1}
‹ Reply to this email directly or view it on GitHub https://github.com/JuliaStats/Roadmap.jl/issues/7#issuecomment-33965546 .
This README is changing as I add features to the prototype, but it meets some of the goals listed at the top. https://github.com/milktrader/TimeArrays.jl/blob/master/README.md
Specifically these features:
And this behavior:
To save time in referring to the linked README (which changes quite a bit anyway), here are the examples:
julia> using TimeArrays, MarketData
julia> ohlc = TimeArray(op, hi, lo, cl); # construct TimeArray from SeriesPair objects
julia> ohlc.colnames = ["Open", "High", "Low", "Close"]; # over-ride the default "value" names
julia> ohlc[10]
1x4 Array{Float64,2} 1980-01-16 to 1980-01-16
Open High Low Close
1980-01-16 | 111.14 112.9 110.38 111.05
julia> ohlc[1:2]
2x4 Array{Float64,2} 1980-01-03 to 1980-01-04
Open High Low Close
1980-01-03 | 105.76 106.08 103.26 105.22
1980-01-04 | 105.22 107.08 105.09 106.52
julia> ohlc[[1,2,10]]
3x4 Array{Float64,2} 1980-01-03 to 1980-01-16
Open High Low Close
1980-01-03 | 105.76 106.08 103.26 105.22
1980-01-04 | 105.22 107.08 105.09 106.52
1980-01-16 | 111.14 112.9 110.38 111.05
julia> firstday, tenthday
(1980-01-03,1980-01-16)
julia> ohlc[firstday]
1x4 Array{Float64,2} 1980-01-03 to 1980-01-03
Open High Low Close
1980-01-03 | 105.76 106.08 103.26 105.22
julia> ohlc[firstday:days(5):tenthday]
2x4 Array{Float64,2} 1980-01-03 to 1980-01-08
Open High Low Close
1980-01-03 | 105.76 106.08 103.26 105.22
1980-01-08 | 106.81 109.29 106.29 108.95
julia> ohlc[[firstday, secondday, tenthday]]
3x4 Array{Float64,2} 1980-01-03 to 1980-01-16
Open High Low Close
1980-01-03 | 105.76 106.08 103.26 105.22
1980-01-04 | 105.22 107.08 105.09 106.52
1980-01-16 | 111.14 112.9 110.38 111.05
julia> ohlc["Open"]
505x1 Array{Float64,1} 1980-01-03 to 1981-12-31
Open
1980-01-03 | 105.76
1980-01-04 | 105.22
1980-01-07 | 106.52
1980-01-08 | 106.81
...
1981-12-24 | 122.31
1981-12-28 | 122.54
1981-12-29 | 122.27
1981-12-30 | 121.67
1981-12-31 | 122.3
julia> ohlc["Open", "Close"]
505x2 Array{Float64,2} 1980-01-03 to 1981-12-31
Open Close
1980-01-03 | 105.76 105.22
1980-01-04 | 105.22 106.52
1980-01-07 | 106.52 106.81
1980-01-08 | 106.81 108.95
...
1981-12-24 | 122.31 122.54
1981-12-28 | 122.54 122.27
1981-12-29 | 122.27 121.67
1981-12-30 | 121.67 122.3
1981-12-31 | 122.3 122.55
You can also get only Tuesdays
julia> tue = x -> x[dayofweek(x) .== 2];
julia> ohlc[tue(ohlc.timestamp)]
103x4 Array{Float64,2} 1980-01-08 to 1981-12-29
Open High Low Close
1980-01-08 | 106.81 109.29 106.29 108.95
1980-01-15 | 110.38 111.93 109.45 111.14
1980-01-22 | 112.1 113.1 110.92 111.51
1980-01-29 | 114.85 115.77 113.03 114.07
...
1981-12-01 | 126.35 127.3 124.84 126.1
1981-12-08 | 125.19 125.75 123.52 124.82
1981-12-15 | 122.78 123.78 121.83 122.99
1981-12-22 | 123.34 124.17 122.19 122.88
1981-12-29 | 122.27 122.9 121.12 121.67
julia> run(`cal 1 1980`)
January 1980
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31
Bonus points awarded for anonymous function that defines the day after Thanksgiving. Hmm, low volume probably and likely an up day. What are the odds that if you buy the market two tics below the close on the day before Thanksgiving and sell it two tics below the close the day after you'd experience a profit?
That will hopefully be made even easier once enum
support drops in Base. Days of the week will then be defined as an enum (DAYOFWEEK => Sunday, Monday, Tuesday,...
) and you can define
getindex(A::TimeArray, d::DAYOFWEEK) = A[dayofweek(A) .== d]
and then just run
ohlc[Tuesday]
@karbarcca that will be an awesome add! Did you notice my bonus question add above? I'll stub out that getindex
method
This will also require a getindex
method on BitArray
but I needed that one anyway.
Here is the inner constructor for TimeArray
, which enforces invariants. I think I've got a reasonable suite of checks going here, at least enough to catch inadvertent mistakes, versus intentional breakage
function TimeArray(timestamp::Vector{Date{ISOCalendar}}, values::Array{T,N}, colnames::Vector{ASCIIString})
nrow, ncol = size(values, 1), size(values, 2)
nrow != size(timestamp, 1) ? error("values must match length of timestamp"):
ncol != size(colnames,1) ? error("column names must match width of array"):
timestamp != unique(timestamp) ? error("there are duplicate dates"):
~(flipud(timestamp) == sort(timestamp) || timestamp == sort(timestamp)) ? error("dates are mangled"):
flipud(timestamp) == sort(timestamp) ?
new(flipud(timestamp), flipud(values), colnames):
new(timestamp, values, colnames)
end
Not sure about the odds on making profit ;) but to get the day after thanksgiving, you can just do
recur(Date(2000,1,1),Date(2015,1,1)) do x
month(x) == November &&
dayofweek(x) == Friday &&
dayofweekofmonth(x) == 4
end
That is, we want the 4th Friday in November from 2000-01-01 through 2015-01-01 (given that Thanksgiving is the 4th Thursday of November). Note that this format is the current implementation in the Base PR.
I really like that API for sharpshooting a time slot. I am getting an error though. Is that because I'm using the current Datetime package and not the PR?
julia> recur(Date(2000,1,1),Date(2015,1,1)) do x
month(x) == November &&
dayofweek(x) == Friday &&
dayofweekofmonth(x) == 4
end
ERROR: type cannot be constructed
Yeah, ranges aren't implemented yet in the Base PR (due to the rangeopocalypse). The new recur
will just return an array of Date
s anyway instead of a range object. In the current Datetime
, you can do
t = [recur(date(2000,1,1):date(2015,1,1)) do x
month(x) == November &&
dayofweek(x) == Friday &&
dayofweekinmonth(x) == 4
end]
To achieve the goal of containing NA
s, it might be best to look into DataArrays. I haven't completely grokked why the separation of packages, but a google groups reply by @johnmyleswhite on julia-users seems to shed some insight:
And in the introduction to DataFrames it specifically mentions that DataArray is an efficient Julian array but with NA
s. This suggests the following definition for TimeArray
immutable TimeArray{T,N}
timestamp::Vector{Date{ISOCalendar}}
values::DataArray{T,N}
# batteries included for NAs and column names with DataArray
# inner constructor to enforce obvious invariants (lengths of elements are equal, etc)
end
TimeVector is an alias for a TimeArray whose value is a DataVector (similar to Pandas Series) and TimeMatrix is an alias for TimeArray whose value is a DataMatrix.
This involves a complete throwing in the trash can of most of my hacking code, which is fine with me. At this point, I'm ready to abandon the SeriesPair
idea. It adds a layer of complexity and is slower than the newest TimeArray
brainstorm. I'm still interested in going down the original TimeArray
path that the TimeArray.jl package implements.
Thoughts?
I can start a branch in TimeSeries named DataArray
(I'd like to reserve TimeArray for the time being since I still have a package by that name). I'll set up the skeleton and stub out the tests (I'm using the FactCheck package lately (it's a little rspec
and a little testthat
) and anyone can make intuitive sense of the workflow there. If not, we can open an issue to discuss it.
If someone else wants to take on this branch idea, please do! I can spend a little more time on the original TimeArray data structure in the mean time. I'll start by opening an issue on TimeSeries.
@karbarcca aha, that's what I was missing. I really like that it returns an Array{Date{ISOCalendar},1}, which should make subsetting time series data intuitive.
I think leveraging the DataArray
package is an excellent approach. It should be a solid, efficient package ready to be leveraged for exactly this kind of thing.
Yeah, returning a range object was a little weird. I think we can still find an efficient solution for iteration too. I'm enjoying seeing the progress of time series code. I wish I had more time to dive in, but school's a little crazy right now. I'll try to poke around when I can.
Okay, jump in any time you'd like @karbarcca. I'll give you commit access.
Yeah, that's a good idea, but I think you should remain more general. TimeArray
should support standard Arrays
too, for cases where you do not want NAs and a higher performance.
But more importantly, I still believe you should be able to associate variables of different types to a date, i.e. like a DataFrame
.
There might be room for both a TimeArray
and a TimeDataArray
, if the performance is indeed noticeable. The other type that leverages the database structure of DataFrames could be called TimeFrame
.
These three types mimic what @cgroll is doing in the TimeData
package with Timedata
, Timenum
and Timematr
Yeah, but you probably don't care what's the actual type of the underlying array. I guess you could do everything you need with only the AbstractArray
interface, so you can support both Arrays
and DataArrays
in the same structure.
I'm closing this for now. Let's continue discussing these issues at the TimeSeries repo issues page. Thanks for all the input so far!
A time-series type (let's call it
TimeArray
) should have the following features:- [ ] operations along columns should be fast- [ ] any column that has missing data for given date should represent it with=> not being supported near term, but an explicit solution is to use DataArray for values in a TimeArray structure.NA
And the following behavior:
ta[[date{1980, 1, 1):date(1980,1,31)]]
ta[Tuesday]
ta["price_range"] = ta["high"] - ta["low"]
# returns ta with new column named price_rangeta["log_returns"] = percentchange(ta["Close"], method=log)
# returns ta with new column named log_returnsprice_range = TimeArray(ta["high"] - ta["low"], colname="price_range")
NOTE: this list is not static and I'll add to it.