JuliaStats / TimeSeries.jl

Time series toolkit for Julia
Other
353 stars 69 forks source link

Resampling time series #257

Closed femtotrader closed 8 years ago

femtotrader commented 8 years ago

Hello,

Resampling time series with Julia could be a nice feature. I noticed this question on SO http://stackoverflow.com/questions/20816813/resampling-a-dataframe-to-hourly-15min-and-5min-periods-in-julia

Maybe someone here can answer it ?

Kind regards

GordStephen commented 8 years ago

We're in the midst of changing its API (https://github.com/JuliaStats/TimeSeries.jl/pull/244), but the collapse method can be used for downsampling. It looks like that question also requires mapping grouped observations to multiple outputs, which we don't currently support - but it shouldn't be too hard to achieve by extending current functionality. I'll add a comment to that effect on SO.

femtotrader commented 8 years ago

Here is a Do-It-Yourself solution using the "Split-Apply-Combine" Strategy.

I assume you download sample data here https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing

Unzip AUDUSD-2014-01.zip (a 11M zip file) and have now AUDUSD-2014-01.csv which is a 85M file

Processing such a file can be quite long, so you can process a shorter file.

julia> using DataFrames

julia> df = readtable("AUDUSD-2014-01.csv", header = false, names=[:Symb, :Date, :Bid, :Ask])  # Read CSV file

1947106x4 DataFrames.DataFrame
│ Row     │ Symb      │ Date                    │ Bid     │ Ask     │
┝━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━┥
│ 1       │ "AUD/USD" │ "20140101 21:55:34.404" │ 0.88796 │ 0.88922 │
│ 2       │ "AUD/USD" │ "20140101 21:55:34.444" │ 0.88805 │ 0.88914 │
│ 3       │ "AUD/USD" │ "20140101 21:55:34.475" │ 0.88809 │ 0.8891  │
│ 4       │ "AUD/USD" │ "20140101 21:55:48.962" │ 0.88811 │ 0.88908 │
│ 5       │ "AUD/USD" │ "20140101 21:56:38.293" │ 0.88808 │ 0.88887 │
│ 6       │ "AUD/USD" │ "20140101 21:56:38.294" │ 0.88798 │ 0.88887 │
│ 7       │ "AUD/USD" │ "20140101 21:56:42.295" │ 0.88807 │ 0.88887 │
│ 8       │ "AUD/USD" │ "20140101 21:56:42.300" │ 0.88814 │ 0.88902 │
│ 9       │ "AUD/USD" │ "20140101 21:56:43.050" │ 0.88807 │ 0.88896 │
⋮
│ 1947097 │ "AUD/USD" │ "20140131 21:59:39.764" │ 0.87533 │ 0.87595 │
│ 1947098 │ "AUD/USD" │ "20140131 21:59:45.299" │ 0.87532 │ 0.87594 │
│ 1947099 │ "AUD/USD" │ "20140131 21:59:45.551" │ 0.87547 │ 0.87601 │
│ 1947100 │ "AUD/USD" │ "20140131 21:59:46.437" │ 0.87527 │ 0.87589 │
│ 1947101 │ "AUD/USD" │ "20140131 21:59:46.439" │ 0.87508 │ 0.87589 │
│ 1947102 │ "AUD/USD" │ "20140131 21:59:48.048" │ 0.87525 │ 0.87589 │
│ 1947103 │ "AUD/USD" │ "20140131 21:59:54.599" │ 0.87527 │ 0.87589 │
│ 1947104 │ "AUD/USD" │ "20140131 21:59:56.927" │ 0.87531 │ 0.87588 │
│ 1947105 │ "AUD/USD" │ "20140131 21:59:59.365" │ 0.87531 │ 0.87574 │
│ 1947106 │ "AUD/USD" │ "20140131 22:00:00.038" │ 0.87531 │ 0.87574 │

julia> df[:Date] = DateTime(df[:Date], "yyyymmdd HH:MM:SS.s");  # convert string to DateTime
julia> EPOCH = DateTime("1970-01-01");
julia> df[:UnixTimestamp] = map(td->Int(td), df[:Date]-EPOCH);  # convert DateTime to Unix timestamp

function roundMod(x::Int, y::Int)
    x - x % y
end

julia> Minutes = 1;
julia> df[:UnixTimestampRound] = map(x->roundMod(x, Minutes*60*1000), df[:UnixTimestamp])  # apply roundMod function along UnixTimestamp
julia> df[:DateRound] = map(ts->Base.Dates.Millisecond(ts) + EPOCH, df[:UnixTimestampRound])  # convert Unix Timestamp Round to DateTime
julia> dfMean = by(df, :DateRound, df -> mean(df[:Bid]))  # group by Unix Timestamp Round and calcule mean of Bid column
31019x2 DataFrames.DataFrame
│ Row   │ DateRound           │ x1       │
┝━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━┥
│ 1     │ 2014-01-01T21:55:00 │ 0.888052 │
│ 2     │ 2014-01-01T21:56:00 │ 0.88781  │
│ 3     │ 2014-01-01T21:57:00 │ 0.887792 │
│ 4     │ 2014-01-01T21:58:00 │ 0.887777 │
│ 5     │ 2014-01-01T21:59:00 │ 0.887855 │
│ 6     │ 2014-01-01T22:03:00 │ 0.888741 │
│ 7     │ 2014-01-01T22:04:00 │ 0.88876  │
│ 8     │ 2014-01-01T22:05:00 │ 0.888801 │
│ 9     │ 2014-01-01T22:06:00 │ 0.888891 │
⋮
│ 31010 │ 2014-01-31T21:51:00 │ 0.875116 │
│ 31011 │ 2014-01-31T21:52:00 │ 0.875452 │
│ 31012 │ 2014-01-31T21:53:00 │ 0.875187 │
│ 31013 │ 2014-01-31T21:54:00 │ 0.875183 │
│ 31014 │ 2014-01-31T21:55:00 │ 0.875123 │
│ 31015 │ 2014-01-31T21:56:00 │ 0.87514  │
│ 31016 │ 2014-01-31T21:57:00 │ 0.874941 │
│ 31017 │ 2014-01-31T21:58:00 │ 0.875151 │
│ 31018 │ 2014-01-31T21:59:00 │ 0.875318 │
│ 31019 │ 2014-01-31T22:00:00 │ 0.87531  │

Resampling using OHLC might also be possible... but I haven't been able to do it as

by(df, :DateRound, df -> min(df[:Bid])) 

raises

ERROR: MethodError: `min` has no method matching min(::DataArrays.DataArray{Float64,1})
Closest candidates are:
  min{T1<:Real,T2<:Real}(::AbstractArray{T1<:Real,N}, ::T2<:Real)
  min{T1<:Real,T2<:Real}(::AbstractArray{T1<:Real,N}, ::AbstractArray{T2<:Real,N})
  min(::Any, ::Any)
  ...
 in anonymous at none:1
 in map at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:178
 in by at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:303
GordStephen commented 8 years ago

by(df, :DateRound, df -> minimum(df[:Bid])) would probably work (min operates over multiple arguments as opposed to a single array argument)

femtotrader commented 8 years ago

so

by(df, :DateRound, df -> minimum(df[:Bid]))

to get "L" lowest value

and

by(df, :DateRound, df -> maximum(df[:Bid]))

to get "H" highest value

But how to get O (Open) and C (Close) ?

I try

by(df, :DateRound, df -> df[:Bid][1,:])

or

by(df, :DateRound, df -> df[:Bid][end,:])

but it doesn't work as I was expecting...

It raises

ERROR: ArgumentError: setindex!(::DataFrame, ...) only broadcasts scalars, not arrays
 in setindex! at /Users/scls/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:375
 in DataFrame at /Users/scls/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:104
 in wrap at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:186
 in map at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:178
 in by at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:303
GordStephen commented 8 years ago

by(df, :DateRound, df -> first(df[:Bid])) and by(df, :DateRound, df -> last(df[:Bid])) should work - assuming open and close mean what I think they do?

femtotrader commented 8 years ago

Thanks... and assuming that data are chronological ordered!

GordStephen commented 8 years ago

Indeed, that's not guaranteed with DataFrames.

Ideally, using TimeSeries.jl one would just use collapse(data, hour, first, [first, maximum, minimum, last]). We don't currently support the vector of reduction functions, but I don't see why we couldn't in the future.

femtotrader commented 8 years ago

Sorry but I don't understand what "hour" is, I also don't understand why the 3rd argument is "first"

GordStephen commented 8 years ago

That's just the collapse function API - the second argument is a function that maps timestamps to a value that can be grouped on locally (in that example, the current hour), and the third argument is a function that reduces all of the timestamps of the grouped observations to a single output value (in that case, an array of all of the timestamps from a given hour are mapped to the first timestamp in the hour).

femtotrader commented 8 years ago

Here it's current minute in fact (not hour)

An other function on top of this will be great (like Pandas resample)

femtotrader commented 7 years ago

A much simpler API could be

Create a sample time series

idx = DateTime(2010,1,1):Minute(1):DateTime(2016,1,1)
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 100 + cumsum(y)
df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx),y,["y"])

Define timeframe

tf = Yearly(5, boundary=End)

or

tf = Yearly(5, boundary=Begin)

Downscaling using mean value

mean(resample(ta, tf))

Downscaling using OHLC value

ohlc(resample(ta, tf))

Downscaling using sum might also be required (to sum volume for example in OHLCV data)

sum(resample(ta, tf))

resample function need to return an object let's call it TimeArraySampler

femtotrader commented 7 years ago

Here is an example with TimeArrayResampler type, a very basic (TimeFrame type which should be improved) and ohlc , mean, sum functions

#using DataFrames
using TimeSeries
import Base: mean, sum

# Define types (TimeFrame, TimeArrayResampler) and functions (resample, ohlc, mean, sum)
type TimeFrame
    f_group::Function
end

type TimeArrayResampler
    ta::TimeArray
    tf::TimeFrame
end

function resample(ta, f_group)
    TimeArrayResampler(ta, f_group)
end

function ohlc(resampler::TimeArrayResampler)
    ta = resampler.ta
    f_group = resampler.tf.f_group
    ta_o = collapse(ta, f_group, first, first)
    ta_h = collapse(ta, f_group, first, maximum)
    ta_l = collapse(ta, f_group, first, minimum)
    ta_c = collapse(ta, f_group, first, last)
    a_ohlc = hcat(ta_o.values, ta_h.values, ta_l.values, ta_c.values)
    ta_ohlc = TimeArray(ta_o.timestamp, a_ohlc, ["Open", "High", "Low", "Close"])
end

function mean(resampler::TimeArrayResampler)
    collapse(resampler.ta, resampler.tf.f_group, first, mean)
end

function sum(resampler::TimeArrayResampler)
    collapse(resampler.ta, resampler.tf.f_group, first, sum)
end

# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
println("ta=")
println(ta)

# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))

# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
println("ta_ohlc=")
println(ta_ohlc)

# resample using mean values
ta_mean = mean(resample(ta, tf))
println("ta_mean=")
println(ta_mean)

# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
println("ta_vol=")
println(ta_vol)

# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
println("ta_vol_sum=")
println(ta_vol_sum)
femtotrader commented 7 years ago

Will you accept a PR that depends on TimeFrames.jl and that define resample, mean, ohlc, sum functions and some tests

or should I create an other project TimeSeriesResampler ?

femtotrader commented 7 years ago

resample.jl

using TimeFrames: TimeFrame, dt_grouper
import Base: mean, sum

type TimeArrayResampler
    ta::TimeArray
    tf::TimeFrame
end

function resample(ta::TimeArray, s_tf::String)
    resample(ta, TimeFrame(s_tf))
end

function resample(ta::TimeArray, tf::TimeFrame)
    TimeArrayResampler(ta, tf)
end

function ohlc(resampler::TimeArrayResampler)
    ta = resampler.ta
    f_group = dt_grouper(resampler.tf)
    ta_o = collapse(ta, f_group, first, first)
    ta_h = collapse(ta, f_group, first, maximum)
    ta_l = collapse(ta, f_group, first, minimum)
    ta_c = collapse(ta, f_group, first, last)
    a_ohlc = hcat(ta_o.values, ta_h.values, ta_l.values, ta_c.values)
    ta_ohlc = TimeArray(ta_o.timestamp, a_ohlc, ["Open", "High", "Low", "Close"])
end

function mean(resampler::TimeArrayResampler)
    collapse(resampler.ta, dt_grouper(resampler.tf), first, mean)
end

function sum(resampler::TimeArrayResampler)
    collapse(resampler.ta, dt_grouper(resampler.tf), first, sum)
end

and tests

using FactCheck
FactCheck.setstyle(:compact)
FactCheck.onlystats(true)

using TimeSeries: TimeArray, resample, ohlc, mean, sum
using TimeFrames: TimeFrame, Minutely

function variation(a; n=1)
    a[1+n:end] - a[1:end-n]
end

# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
srand(0)  # set the random seed in julia generator of random numbers

# Define how timestamp (ie Date or DateTime) should be grouped (using a TimeFrame)
a_tf = [
    TimeFrame(dt -> floor(dt, Dates.Minute(15))),  # using a lambda function
    TimeFrame(Minutely(15)),  # using a TimeFrame object (from TimeFrames.jl)
    TimeFrame("15T"),  # using a string TimeFrame shortcut to create a TimeFrame
    "15T",  # using a string TimeFrame shortcut 
]
for tf in a_tf

    facts("resample methods with $tf") do

        context("Define a sample timeseries (prices for example)") do
            y = rand(-1.0:0.01:1.0, N)
            y = 1000 + cumsum(y)
            #df = DataFrame(Date=idx, y=y)
            ta = TimeArray(collect(idx), y, ["y"])

            context("resample using OHLC values works") do
                ta_ohlc = ohlc(resample(ta, tf))
                @fact mean(variation(ta_ohlc.timestamp)) --> Dates.Minute(15)
            end

            context("resample using mean values works") do
                ta_mean = mean(resample(ta, tf))
                @fact mean(variation(ta_mean.timestamp)) --> Dates.Minute(15)
            end

        end

        context("resample using sum values works") do
            # Define an other sample timeseries (volume for example)
            vol = rand(0:0.01:1.0, N)
            ta_vol = TimeArray(collect(idx), vol, ["vol"])

            # resample using sum values
            ta_vol_sum = sum(resample(ta_vol, tf))
            @fact mean(variation(ta_vol_sum.timestamp)) --> Dates.Minute(15)
        end
    end

end