JuliaArrays / AxisArrays.jl

Performant arrays where each dimension can have a named axis with values
http://JuliaArrays.github.io/AxisArrays.jl/latest/
Other
200 stars 42 forks source link

TimeArray meets AxisArray #8

Closed milktrader closed 9 years ago

milktrader commented 9 years ago
julia> using MarketData

julia> Apple = AxisArray(ohlc[1:12].values, (ohlc[1:12].timestamp, colnames(ohlc)))
12x4 AxisArrays.AxisArray{Float64,2,Array{Float64,2},(:row,:col),(Array{Base.Dates.Date,1},Array{UTF8String,1})}:
 104.88  112.5   101.69  111.94
 108.25  110.62  101.19  102.5 
 103.75  110.56  103.0   104.0 
 106.12  107.0    95.0    95.0 
  96.5   101.0    95.5    99.5 
 102.0   102.25   94.75   97.75
  95.94   99.38   90.5    92.75
  95.0    95.5    86.5    87.19
  94.48   98.75   92.5    96.75
 100.0   102.25   99.38  100.44
 101.0   106.0   100.44  103.94
 105.62  108.75  103.38  106.56

julia> Apple.axes[1]
12-element Array{Base.Dates.Date,1}:
 2000-01-03
 2000-01-04
 2000-01-05
 2000-01-06
 2000-01-07
 2000-01-10
 2000-01-11
 2000-01-12
 2000-01-13
 2000-01-14
 2000-01-18
 2000-01-19

julia> Apple.axes[2]
4-element Array{UTF8String,1}:
 "Open" 
 "High" 
 "Low"  
 "Close"
mbauman commented 9 years ago

Awesome. What do you think? What are your thoughts on interval selection of rows based upon the date? A big difference from TimeSeries is that we currently only support indexing by Interval(Date(), Date()) to select the dates within a range… and I think if we add support for StepRanges (Date():Date()), I would like to throw an error if there's a day missing.

tshort commented 9 years ago

Cool. Plotting works:

using Gadfly      # bunch of warnings...
plot(Apple, x = :row, y = :data, color = :col)
plot(Apple, x = :row, y = :data, ygroup = :col, Geom.subplot_grid(Geom.line))
milktrader commented 9 years ago

Interesting about wanting to throw an error if dates are missing. Irregular time series are the norm in finance-type series

julia> ohlc[Date(2000,1,1):Date(2000,1,10)]
6x4 TimeSeries.TimeArray{Float64,2,DataType} 2000-01-03 to 2000-01-10

             Open      High      Low       Close     
2000-01-03 | 104.88    112.5     101.69    111.94    
2000-01-04 | 108.25    110.62    101.19    102.5     
2000-01-05 | 103.75    110.56    103.0     104.0     
2000-01-06 | 106.12    107.0     95.0      95.0      
2000-01-07 | 96.5      101.0     95.5      99.5      
2000-01-10 | 102.0     102.25    94.75     97.75     

As you can see, the market was closed on 2000-01-08 and 2000-01-09. I suppose in a domain where you're observing a naturally occurring event, something happens in every evenly-spaced time segment.

mbauman commented 9 years ago

Yup, exactly. That's why I asked :). It's very unfortunate that there's not a nicer built-in syntax for intervals. I want to start conservatively with the types of custom indexing that we support, and I think that silently dropping values from a range is a little too magical for general multi-disciplinary use. Here's why:

# I think the following should be invariants:
A[1:2] == A[collect(1:2)] == A[[1,2]] == [A[1]; A[2]]
A[today():tomorrow()] == A[collect(today():tomorrow())] == [A[today()]; A[tomorrow()]]

We'll see what others think once we register the package and we get more eyeballs.

Edit: I suppose we could return an empty array when indexing by values (even scalar values) that aren't there. That would make the above consistent.

milktrader commented 9 years ago

The merge method is painfully slow in TimeSeries. Here is the TimeArray demo. The CAT20 object is the 20-period moving average and is 19 rows shorter than the original CAT object. The merge lines up the dates and results in an object whose length matches the shorter one (CAT20), which is 13,071 rows.

julia> using AxisArrays, MarketData

julia> CAT20 = moving(CAT["Close"],mean,20);

julia> @elapsed merge(CAT20,CAT)
1.310317264

julia> @elapsed merge(CAT20,CAT)
0.566255459

julia> @elapsed merge(CAT20,CAT)
0.56215796

Next up, trying this with AxisArrays.

milktrader commented 9 years ago

Oops,

julia> AxCAT = AxisArray(CAT.values,(CAT.timestamp, CAT.colnames));

julia> AxCAT20 = AxisArray(CAT20.values,(CAT20.timestamp, CAT20.colnames));
ERROR: there may not be more axes than dimensions of data
 in call at /Users/Administrator/.julia/v0.4/AxisArrays/src/core.jl:93
 in call at /Users/Administrator/.julia/v0.4/AxisArrays/src/core.jl:104
milktrader commented 9 years ago

For now, I'll just leave out the column name and see about merging these two together ...

julia> AxCAT20 = AxisArray(CAT20.values,(CAT20.timestamp,));

Took me a couple tries to figure out the tuple of length one trick :smile:

milktrader commented 9 years ago

Starting all over on this demo ...

TimeSeries merge method

julia> using MarketData

## AAPL has 8300+ rows and BA has 13000+ rows
julia> @elapsed merge(AAPL,BA)
0.608618657

julia> @elapsed merge(AAPL,BA)
0.465285163

julia> @elapsed merge(AAPL,BA)
0.471265437

TimeSeries merge1 method (about one hour old, btw)

julia> using MarketData

# merge1 is currently only available in a branch named merge_faster
julia> @elapsed merge1(AAPL,BA)
0.121858314

julia> @elapsed merge1(AAPL,BA)
0.106935782

julia> @elapsed merge1(AAPL,BA)
0.116042072

Merging with AxisArrays

julia> using MarketData, AxisArrays

julia> AxCAT = AxisArray(CAT.values,(CAT.timestamp,CAT.colnames)); #13000+ rows

julia> AxAAPL = AxisArray(AAPL.values,(AAPL.timestamp,AAPL.colnames)); #8300+ rows

julia> function m(a1::AxisArray, a2::AxisArray)
           tstamp = intersect(a1.axes[1], a2.axes[1])
           tt1    = findin(a1.axes[1], tstamp)
           tt2    = findin(a2.axes[1], tstamp)
           vals1  = a1[Axis{:row}(tt1)].data
           vals2  = a2[Axis{:row}(tt2)].data
           vals   = hcat(vals1,vals2)
           cnam1  = [a1.axes[2][a] * "_1" for a in 1:size(a1.axes[2],1)]
           cnam2  = [a2.axes[2][a] * "_2" for a in 1:size(a2.axes[2],1)]
           cnames = vcat(cnam1, cnam2)
           AxisArray(vals,(tstamp, cnames))
       end
m (generic function with 1 method)

julia> @elapsed m(AxCAT, AxAAPL)
1.132784239    # not sure why this compilation takes so long

julia> @elapsed m(AxCAT, AxAAPL)
0.092879273

julia> @elapsed m(AxCAT, AxAAPL)
0.094113886

julia> @elapsed m(AxCAT, AxAAPL)
0.093409008
milktrader commented 9 years ago

Here is where merge1 lives https://github.com/JuliaStats/TimeSeries.jl/pull/171

mbauman commented 9 years ago

Very cool. I think a merge function like that should definitely live in AxisArrays. I'm glad to see that a simple first-pass solution here is already quite performant! We'll have to do some thinking about what column (or axis) renaming means in a more general sense, but I like where this is headed.

Another option would be to return a DataArray with NAs for the missing (non-overlapping) axis values.

milktrader commented 9 years ago

When I started to think about an AxisArray merge method I discovered some things that allowed me to go back to the TimeArray merge method, refactor and gain 4-5x speed improvement with 75% less code. I'll keep tabs on improvements made here and maybe get some even better performance.

tshort commented 9 years ago

With a small tweak, you could have your merge add a dimension:

function m(a1::AxisArray, a2::AxisArray)
    tstamp = intersect(a1.axes[1], a2.axes[1])
    tt1    = findin(a1.axes[1], tstamp)
    tt2    = findin(a2.axes[1], tstamp)
    vals1  = a1[Axis{:row}(tt1)].data
    vals2  = a2[Axis{:row}(tt2)].data
    vals   = reshape(hcat(vals1,vals2), size(vals1)..., 2)
    AxisArray(vals,(tstamp, a1.axes[2], ["stock1","stock2"]))
end
a = m(AxCAT, AxAAPL)
using Gadfly
plot(a[:,2:3,:], x = :row, y = :data, color = :col, ygroup = :page, Geom.subplot_grid(Geom.line))

Gadfly is pretty slow here, but it's flexible.

tshort commented 9 years ago

Here's a faster merge with sorted axes:

function overlaps(ordered1, ordered2)
    i = j = 1
    idx1 = Int[]
    idx2 = Int[]
    while i < length(ordered1) && j < length(ordered2)
        if ordered1[i] > ordered2[j]
            j += 1
        elseif ordered1[j] > ordered2[i]
            i += 1
        else
            push!(idx1, i)
            push!(idx2, j)
            i += 1
            j += 1
        end
    end
    (idx1, idx2)        
end

function m(a1::AxisArray, a2::AxisArray)
    idx1, idx2 = overlaps(a1.axes[1], a2.axes[1])
    vals1  = a1[Axis{:row}(idx1)].data
    vals2  = a2[Axis{:row}(idx2)].data
    vals   = reshape(hcat(vals1,vals2), size(vals1)..., 2)
    AxisArray(vals,(a1.axes[1][idx1], a1.axes[2], ["stock1","stock2"]))
end

@elapsed m(AxCAT, AxAAPL)    # 0.004938341
tshort commented 9 years ago

Bug fix in the code above:

function overlaps(ordered1, ordered2)
    i = j = 1
    idx1 = Int[]
    idx2 = Int[]
    while i < length(ordered1) && j < length(ordered2)
        if ordered1[i] > ordered2[j]
            j += 1
        elseif ordered1[i] < ordered2[j]
            i += 1
        else
            push!(idx1, i)
            push!(idx2, j)
            i += 1
            j += 1
        end
    end
    (idx1, idx2)        
end
tshort commented 9 years ago

We probably want to merge on both axes by default. We'd need another version of overlaps for Categorical axes.

milktrader commented 9 years ago

Whoa, that is an impressive speedup! I would not have guessed that push! ing into an empty array would be fast.

milktrader commented 9 years ago

An OBO error in overlaps

while i < length(ordered1) + 1 && j < length(ordered2) + 1

I would've expected that the merged object would equal the size of AAPL, which is 8336, but the correct size of the merged object (from an inner join) is actually 8335. The NYSE was closed on Sept 27,1985 due to Hurricane Gloria, but apparently the NASDAQ stayed open, being an electronic exchange. So since there is no trade data for Caterpillar on that day, the resulting length of the merge should be 8335. Both datasets start and finish on the same day.

I found the missing day with the following ...

julia> apple, caterpillar = overlaps(AAPL.timestamp, CAT.timestamp);

# this result affects the resulting length of the merged object
julia> find([apple[i] + 1 != apple[i+1] for i in 1:length(apple)-1])
1-element Array{Int64,1}:
 1210

# this result is just a curiosity 
julia> find([caterpillar[i] + 1 != caterpillar[i+1] for i in 1:length(caterpillar)-1])
2-element Array{Int64,1}:
 165
 702

julia> AAPL.timestamp[1211]
1985-09-27

julia> CAT[Date(1985,9,27)]

# nothing output

Mostly jotting down this forensics paper trail because I'm going to forget how I did this.

milktrader commented 9 years ago

@tshort thanks so much for that faster merge implementation. I've adapted it for TimeSeries. The performance is better than 12x the original merge and about 5x faster than my just-designed merge1.

Next up ... moving averages comparing the expensive moving method in TimeSeries vs an AxisArrays interval approach.

milktrader commented 9 years ago

Some timings for moving averages

julia> using MarketData, AxisArrays

julia> C = CAT["Close"];  # Single column TimeArray with 13,000+ rows

julia> Cx = AxisArray(C.values, (C.timestamp,));

julia> function movings(a::AxisArray, f::Function, n::Int)
           m = length(a)-n+1
           res = zeros(m)
           for i in 1:m
               res[i] = f(a[Interval(a.axes[1][i],a.axes[1][i+n-1])])
           end
           AxisArray(res, (a.axes[1][n:end],))
       end

julia> @elapsed movings(Cx,mean,20)
0.085827402

julia> @elapsed movings(Cx,mean,20)
0.039679677

julia> @elapsed movings(Cx,mean,20)
0.056376047

julia> @elapsed moving(C,mean,20)
0.012630849

julia> @elapsed moving(C,mean,20)
0.026267102

julia> @elapsed moving(C,mean,20)
0.011673069
tshort commented 9 years ago

Thanks for doing this--great use cases. I wonder if the slowdown for AxisArrays is from the ordering check (Matt mentioned that in the roadmap issue).

milktrader commented 9 years ago

To make it easier for others to play along, I'm planning to add TimeSeries/timeseries.jl to this repo. This will include some const objects that will be easier to use than github issues cut-and-paste.

Once the experimentation is complete, it will be easy to remove this directory. Also, it will not load without an explicit using AxisArrays.TimeSeries so those not interested in participating will not be compelled to do so.

This magic is named nested modules and I've always wanted to try it. Planning next few days without objections.

mbauman commented 9 years ago

It may make more sense to just work on AxisArrays directly without the submodule. If you do it in a branch then others can play along without you needing to worry about breaking anything. I think your use-cases can be nicely expressed as tests (e.g., create a TimeSeries test file that contains some sample data and tests the operations you want). And then we can work on creating merge and moving average implementations directly in the source tree.

Having concrete examples like this is a great way to drive the development!

milktrader commented 9 years ago

Okay, I'll kill the PR and just leave it a branch and begin some method implementations in source. Just to clarify though, I didn't plan a git submodule approach (which I suspect many git users myself included don't fully understand) but rather a Julia nested module.

In any case, it's a better idea to work in a branch as you suggested.

milktrader commented 9 years ago

Closing this issue and working in the timeseries branch for now.