Closed milktrader closed 9 years ago
Awesome. What do you think? What are your thoughts on interval selection of rows based upon the date? A big difference from TimeSeries is that we currently only support indexing by Interval(Date(), Date())
to select the dates within a range… and I think if we add support for StepRanges (Date():Date()
), I would like to throw an error if there's a day missing.
Cool. Plotting works:
using Gadfly # bunch of warnings...
plot(Apple, x = :row, y = :data, color = :col)
plot(Apple, x = :row, y = :data, ygroup = :col, Geom.subplot_grid(Geom.line))
Interesting about wanting to throw an error if dates are missing. Irregular time series are the norm in finance-type series
julia> ohlc[Date(2000,1,1):Date(2000,1,10)]
6x4 TimeSeries.TimeArray{Float64,2,DataType} 2000-01-03 to 2000-01-10
Open High Low Close
2000-01-03 | 104.88 112.5 101.69 111.94
2000-01-04 | 108.25 110.62 101.19 102.5
2000-01-05 | 103.75 110.56 103.0 104.0
2000-01-06 | 106.12 107.0 95.0 95.0
2000-01-07 | 96.5 101.0 95.5 99.5
2000-01-10 | 102.0 102.25 94.75 97.75
As you can see, the market was closed on 2000-01-08 and 2000-01-09. I suppose in a domain where you're observing a naturally occurring event, something happens in every evenly-spaced time segment.
Yup, exactly. That's why I asked :). It's very unfortunate that there's not a nicer built-in syntax for intervals. I want to start conservatively with the types of custom indexing that we support, and I think that silently dropping values from a range is a little too magical for general multi-disciplinary use. Here's why:
# I think the following should be invariants:
A[1:2] == A[collect(1:2)] == A[[1,2]] == [A[1]; A[2]]
A[today():tomorrow()] == A[collect(today():tomorrow())] == [A[today()]; A[tomorrow()]]
We'll see what others think once we register the package and we get more eyeballs.
Edit: I suppose we could return an empty array when indexing by values (even scalar values) that aren't there. That would make the above consistent.
The merge
method is painfully slow in TimeSeries. Here is the TimeArray demo. The CAT20
object is the 20-period moving average and is 19 rows shorter than the original CAT
object. The merge lines up the dates and results in an object whose length matches the shorter one (CAT20
), which is 13,071 rows.
julia> using AxisArrays, MarketData
julia> CAT20 = moving(CAT["Close"],mean,20);
julia> @elapsed merge(CAT20,CAT)
1.310317264
julia> @elapsed merge(CAT20,CAT)
0.566255459
julia> @elapsed merge(CAT20,CAT)
0.56215796
Next up, trying this with AxisArrays.
Oops,
julia> AxCAT = AxisArray(CAT.values,(CAT.timestamp, CAT.colnames));
julia> AxCAT20 = AxisArray(CAT20.values,(CAT20.timestamp, CAT20.colnames));
ERROR: there may not be more axes than dimensions of data
in call at /Users/Administrator/.julia/v0.4/AxisArrays/src/core.jl:93
in call at /Users/Administrator/.julia/v0.4/AxisArrays/src/core.jl:104
For now, I'll just leave out the column name and see about merging these two together ...
julia> AxCAT20 = AxisArray(CAT20.values,(CAT20.timestamp,));
Took me a couple tries to figure out the tuple of length one trick :smile:
Starting all over on this demo ...
merge
methodjulia> using MarketData
## AAPL has 8300+ rows and BA has 13000+ rows
julia> @elapsed merge(AAPL,BA)
0.608618657
julia> @elapsed merge(AAPL,BA)
0.465285163
julia> @elapsed merge(AAPL,BA)
0.471265437
merge1
method (about one hour old, btw)julia> using MarketData
# merge1 is currently only available in a branch named merge_faster
julia> @elapsed merge1(AAPL,BA)
0.121858314
julia> @elapsed merge1(AAPL,BA)
0.106935782
julia> @elapsed merge1(AAPL,BA)
0.116042072
julia> using MarketData, AxisArrays
julia> AxCAT = AxisArray(CAT.values,(CAT.timestamp,CAT.colnames)); #13000+ rows
julia> AxAAPL = AxisArray(AAPL.values,(AAPL.timestamp,AAPL.colnames)); #8300+ rows
julia> function m(a1::AxisArray, a2::AxisArray)
tstamp = intersect(a1.axes[1], a2.axes[1])
tt1 = findin(a1.axes[1], tstamp)
tt2 = findin(a2.axes[1], tstamp)
vals1 = a1[Axis{:row}(tt1)].data
vals2 = a2[Axis{:row}(tt2)].data
vals = hcat(vals1,vals2)
cnam1 = [a1.axes[2][a] * "_1" for a in 1:size(a1.axes[2],1)]
cnam2 = [a2.axes[2][a] * "_2" for a in 1:size(a2.axes[2],1)]
cnames = vcat(cnam1, cnam2)
AxisArray(vals,(tstamp, cnames))
end
m (generic function with 1 method)
julia> @elapsed m(AxCAT, AxAAPL)
1.132784239 # not sure why this compilation takes so long
julia> @elapsed m(AxCAT, AxAAPL)
0.092879273
julia> @elapsed m(AxCAT, AxAAPL)
0.094113886
julia> @elapsed m(AxCAT, AxAAPL)
0.093409008
Here is where merge1
lives https://github.com/JuliaStats/TimeSeries.jl/pull/171
Very cool. I think a merge function like that should definitely live in AxisArrays. I'm glad to see that a simple first-pass solution here is already quite performant! We'll have to do some thinking about what column (or axis) renaming means in a more general sense, but I like where this is headed.
Another option would be to return a DataArray with NAs for the missing (non-overlapping) axis values.
When I started to think about an AxisArray merge
method I discovered some things that allowed me to go back to the TimeArray merge
method, refactor and gain 4-5x speed improvement with 75% less code. I'll keep tabs on improvements made here and maybe get some even better performance.
With a small tweak, you could have your merge add a dimension:
function m(a1::AxisArray, a2::AxisArray)
tstamp = intersect(a1.axes[1], a2.axes[1])
tt1 = findin(a1.axes[1], tstamp)
tt2 = findin(a2.axes[1], tstamp)
vals1 = a1[Axis{:row}(tt1)].data
vals2 = a2[Axis{:row}(tt2)].data
vals = reshape(hcat(vals1,vals2), size(vals1)..., 2)
AxisArray(vals,(tstamp, a1.axes[2], ["stock1","stock2"]))
end
a = m(AxCAT, AxAAPL)
using Gadfly
plot(a[:,2:3,:], x = :row, y = :data, color = :col, ygroup = :page, Geom.subplot_grid(Geom.line))
Gadfly is pretty slow here, but it's flexible.
Here's a faster merge with sorted axes:
function overlaps(ordered1, ordered2)
i = j = 1
idx1 = Int[]
idx2 = Int[]
while i < length(ordered1) && j < length(ordered2)
if ordered1[i] > ordered2[j]
j += 1
elseif ordered1[j] > ordered2[i]
i += 1
else
push!(idx1, i)
push!(idx2, j)
i += 1
j += 1
end
end
(idx1, idx2)
end
function m(a1::AxisArray, a2::AxisArray)
idx1, idx2 = overlaps(a1.axes[1], a2.axes[1])
vals1 = a1[Axis{:row}(idx1)].data
vals2 = a2[Axis{:row}(idx2)].data
vals = reshape(hcat(vals1,vals2), size(vals1)..., 2)
AxisArray(vals,(a1.axes[1][idx1], a1.axes[2], ["stock1","stock2"]))
end
@elapsed m(AxCAT, AxAAPL) # 0.004938341
Bug fix in the code above:
function overlaps(ordered1, ordered2)
i = j = 1
idx1 = Int[]
idx2 = Int[]
while i < length(ordered1) && j < length(ordered2)
if ordered1[i] > ordered2[j]
j += 1
elseif ordered1[i] < ordered2[j]
i += 1
else
push!(idx1, i)
push!(idx2, j)
i += 1
j += 1
end
end
(idx1, idx2)
end
We probably want to merge on both axes by default. We'd need another version of overlaps
for Categorical axes.
Whoa, that is an impressive speedup! I would not have guessed that push!
ing into an empty array would be fast.
An OBO error in overlaps
while i < length(ordered1) + 1 && j < length(ordered2) + 1
I would've expected that the merged object would equal the size of AAPL, which is 8336, but the correct size of the merged object (from an inner join) is actually 8335. The NYSE was closed on Sept 27,1985 due to Hurricane Gloria, but apparently the NASDAQ stayed open, being an electronic exchange. So since there is no trade data for Caterpillar on that day, the resulting length of the merge should be 8335. Both datasets start and finish on the same day.
I found the missing day with the following ...
julia> apple, caterpillar = overlaps(AAPL.timestamp, CAT.timestamp);
# this result affects the resulting length of the merged object
julia> find([apple[i] + 1 != apple[i+1] for i in 1:length(apple)-1])
1-element Array{Int64,1}:
1210
# this result is just a curiosity
julia> find([caterpillar[i] + 1 != caterpillar[i+1] for i in 1:length(caterpillar)-1])
2-element Array{Int64,1}:
165
702
julia> AAPL.timestamp[1211]
1985-09-27
julia> CAT[Date(1985,9,27)]
# nothing output
Mostly jotting down this forensics paper trail because I'm going to forget how I did this.
@tshort thanks so much for that faster merge implementation. I've adapted it for TimeSeries. The performance is better than 12x the original merge
and about 5x faster than my just-designed merge1
.
Next up ... moving averages comparing the expensive moving
method in TimeSeries vs an AxisArrays interval
approach.
Some timings for moving averages
julia> using MarketData, AxisArrays
julia> C = CAT["Close"]; # Single column TimeArray with 13,000+ rows
julia> Cx = AxisArray(C.values, (C.timestamp,));
julia> function movings(a::AxisArray, f::Function, n::Int)
m = length(a)-n+1
res = zeros(m)
for i in 1:m
res[i] = f(a[Interval(a.axes[1][i],a.axes[1][i+n-1])])
end
AxisArray(res, (a.axes[1][n:end],))
end
julia> @elapsed movings(Cx,mean,20)
0.085827402
julia> @elapsed movings(Cx,mean,20)
0.039679677
julia> @elapsed movings(Cx,mean,20)
0.056376047
julia> @elapsed moving(C,mean,20)
0.012630849
julia> @elapsed moving(C,mean,20)
0.026267102
julia> @elapsed moving(C,mean,20)
0.011673069
Thanks for doing this--great use cases. I wonder if the slowdown for AxisArrays is from the ordering check (Matt mentioned that in the roadmap issue).
To make it easier for others to play along, I'm planning to add TimeSeries/timeseries.jl
to this repo. This will include some const
objects that will be easier to use than github issues cut-and-paste.
Once the experimentation is complete, it will be easy to remove this directory. Also, it will not load without an explicit using AxisArrays.TimeSeries
so those not interested in participating will not be compelled to do so.
This magic is named nested modules and I've always wanted to try it. Planning next few days without objections.
It may make more sense to just work on AxisArrays directly without the submodule. If you do it in a branch then others can play along without you needing to worry about breaking anything. I think your use-cases can be nicely expressed as tests (e.g., create a TimeSeries test file that contains some sample data and tests the operations you want). And then we can work on creating merge
and moving average implementations directly in the source tree.
Having concrete examples like this is a great way to drive the development!
Okay, I'll kill the PR and just leave it a branch and begin some method implementations in source. Just to clarify though, I didn't plan a git submodule approach (which I suspect many git users myself included don't fully understand) but rather a Julia nested module.
In any case, it's a better idea to work in a branch as you suggested.
Closing this issue and working in the timeseries
branch for now.