Closed femtotrader closed 8 years ago
We're in the midst of changing its API (https://github.com/JuliaStats/TimeSeries.jl/pull/244), but the collapse
method can be used for downsampling. It looks like that question also requires mapping grouped observations to multiple outputs, which we don't currently support - but it shouldn't be too hard to achieve by extending current functionality. I'll add a comment to that effect on SO.
Here is a Do-It-Yourself solution using the "Split-Apply-Combine" Strategy.
I assume you download sample data here https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing
Unzip AUDUSD-2014-01.zip
(a 11M zip file) and have now AUDUSD-2014-01.csv
which is a 85M file
Processing such a file can be quite long, so you can process a shorter file.
julia> using DataFrames
julia> df = readtable("AUDUSD-2014-01.csv", header = false, names=[:Symb, :Date, :Bid, :Ask]) # Read CSV file
1947106x4 DataFrames.DataFrame
│ Row │ Symb │ Date │ Bid │ Ask │
┝━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━┥
│ 1 │ "AUD/USD" │ "20140101 21:55:34.404" │ 0.88796 │ 0.88922 │
│ 2 │ "AUD/USD" │ "20140101 21:55:34.444" │ 0.88805 │ 0.88914 │
│ 3 │ "AUD/USD" │ "20140101 21:55:34.475" │ 0.88809 │ 0.8891 │
│ 4 │ "AUD/USD" │ "20140101 21:55:48.962" │ 0.88811 │ 0.88908 │
│ 5 │ "AUD/USD" │ "20140101 21:56:38.293" │ 0.88808 │ 0.88887 │
│ 6 │ "AUD/USD" │ "20140101 21:56:38.294" │ 0.88798 │ 0.88887 │
│ 7 │ "AUD/USD" │ "20140101 21:56:42.295" │ 0.88807 │ 0.88887 │
│ 8 │ "AUD/USD" │ "20140101 21:56:42.300" │ 0.88814 │ 0.88902 │
│ 9 │ "AUD/USD" │ "20140101 21:56:43.050" │ 0.88807 │ 0.88896 │
⋮
│ 1947097 │ "AUD/USD" │ "20140131 21:59:39.764" │ 0.87533 │ 0.87595 │
│ 1947098 │ "AUD/USD" │ "20140131 21:59:45.299" │ 0.87532 │ 0.87594 │
│ 1947099 │ "AUD/USD" │ "20140131 21:59:45.551" │ 0.87547 │ 0.87601 │
│ 1947100 │ "AUD/USD" │ "20140131 21:59:46.437" │ 0.87527 │ 0.87589 │
│ 1947101 │ "AUD/USD" │ "20140131 21:59:46.439" │ 0.87508 │ 0.87589 │
│ 1947102 │ "AUD/USD" │ "20140131 21:59:48.048" │ 0.87525 │ 0.87589 │
│ 1947103 │ "AUD/USD" │ "20140131 21:59:54.599" │ 0.87527 │ 0.87589 │
│ 1947104 │ "AUD/USD" │ "20140131 21:59:56.927" │ 0.87531 │ 0.87588 │
│ 1947105 │ "AUD/USD" │ "20140131 21:59:59.365" │ 0.87531 │ 0.87574 │
│ 1947106 │ "AUD/USD" │ "20140131 22:00:00.038" │ 0.87531 │ 0.87574 │
julia> df[:Date] = DateTime(df[:Date], "yyyymmdd HH:MM:SS.s"); # convert string to DateTime
julia> EPOCH = DateTime("1970-01-01");
julia> df[:UnixTimestamp] = map(td->Int(td), df[:Date]-EPOCH); # convert DateTime to Unix timestamp
function roundMod(x::Int, y::Int)
x - x % y
end
julia> Minutes = 1;
julia> df[:UnixTimestampRound] = map(x->roundMod(x, Minutes*60*1000), df[:UnixTimestamp]) # apply roundMod function along UnixTimestamp
julia> df[:DateRound] = map(ts->Base.Dates.Millisecond(ts) + EPOCH, df[:UnixTimestampRound]) # convert Unix Timestamp Round to DateTime
julia> dfMean = by(df, :DateRound, df -> mean(df[:Bid])) # group by Unix Timestamp Round and calcule mean of Bid column
31019x2 DataFrames.DataFrame
│ Row │ DateRound │ x1 │
┝━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━┥
│ 1 │ 2014-01-01T21:55:00 │ 0.888052 │
│ 2 │ 2014-01-01T21:56:00 │ 0.88781 │
│ 3 │ 2014-01-01T21:57:00 │ 0.887792 │
│ 4 │ 2014-01-01T21:58:00 │ 0.887777 │
│ 5 │ 2014-01-01T21:59:00 │ 0.887855 │
│ 6 │ 2014-01-01T22:03:00 │ 0.888741 │
│ 7 │ 2014-01-01T22:04:00 │ 0.88876 │
│ 8 │ 2014-01-01T22:05:00 │ 0.888801 │
│ 9 │ 2014-01-01T22:06:00 │ 0.888891 │
⋮
│ 31010 │ 2014-01-31T21:51:00 │ 0.875116 │
│ 31011 │ 2014-01-31T21:52:00 │ 0.875452 │
│ 31012 │ 2014-01-31T21:53:00 │ 0.875187 │
│ 31013 │ 2014-01-31T21:54:00 │ 0.875183 │
│ 31014 │ 2014-01-31T21:55:00 │ 0.875123 │
│ 31015 │ 2014-01-31T21:56:00 │ 0.87514 │
│ 31016 │ 2014-01-31T21:57:00 │ 0.874941 │
│ 31017 │ 2014-01-31T21:58:00 │ 0.875151 │
│ 31018 │ 2014-01-31T21:59:00 │ 0.875318 │
│ 31019 │ 2014-01-31T22:00:00 │ 0.87531 │
Resampling using OHLC might also be possible... but I haven't been able to do it as
by(df, :DateRound, df -> min(df[:Bid]))
raises
ERROR: MethodError: `min` has no method matching min(::DataArrays.DataArray{Float64,1})
Closest candidates are:
min{T1<:Real,T2<:Real}(::AbstractArray{T1<:Real,N}, ::T2<:Real)
min{T1<:Real,T2<:Real}(::AbstractArray{T1<:Real,N}, ::AbstractArray{T2<:Real,N})
min(::Any, ::Any)
...
in anonymous at none:1
in map at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:178
in by at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:303
by(df, :DateRound, df -> minimum(df[:Bid]))
would probably work (min
operates over multiple arguments as opposed to a single array argument)
so
by(df, :DateRound, df -> minimum(df[:Bid]))
to get "L" lowest value
and
by(df, :DateRound, df -> maximum(df[:Bid]))
to get "H" highest value
But how to get O (Open) and C (Close) ?
I try
by(df, :DateRound, df -> df[:Bid][1,:])
or
by(df, :DateRound, df -> df[:Bid][end,:])
but it doesn't work as I was expecting...
It raises
ERROR: ArgumentError: setindex!(::DataFrame, ...) only broadcasts scalars, not arrays
in setindex! at /Users/scls/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:375
in DataFrame at /Users/scls/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:104
in wrap at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:186
in map at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:178
in by at /Users/scls/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:303
by(df, :DateRound, df -> first(df[:Bid]))
and by(df, :DateRound, df -> last(df[:Bid]))
should work - assuming open and close mean what I think they do?
Thanks... and assuming that data are chronological ordered!
Indeed, that's not guaranteed with DataFrames.
Ideally, using TimeSeries.jl one would just use collapse(data, hour, first, [first, maximum, minimum, last])
. We don't currently support the vector of reduction functions, but I don't see why we couldn't in the future.
Sorry but I don't understand what "hour" is, I also don't understand why the 3rd argument is "first"
That's just the collapse
function API - the second argument is a function that maps timestamps to a value that can be grouped on locally (in that example, the current hour), and the third argument is a function that reduces all of the timestamps of the grouped observations to a single output value (in that case, an array of all of the timestamps from a given hour are mapped to the first timestamp in the hour).
Here it's current minute in fact (not hour)
An other function on top of this will be great (like Pandas resample)
A much simpler API could be
Create a sample time series
idx = DateTime(2010,1,1):Minute(1):DateTime(2016,1,1)
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 100 + cumsum(y)
df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx),y,["y"])
Define timeframe
tf = Yearly(5, boundary=End)
or
tf = Yearly(5, boundary=Begin)
Downscaling using mean value
mean(resample(ta, tf))
Downscaling using OHLC value
ohlc(resample(ta, tf))
Downscaling using sum might also be required (to sum volume for example in OHLCV data)
sum(resample(ta, tf))
resample
function need to return an object let's call it TimeArraySampler
Here is an example with TimeArrayResampler
type, a very basic (TimeFrame
type which should be improved) and ohlc
, mean
, sum
functions
#using DataFrames
using TimeSeries
import Base: mean, sum
# Define types (TimeFrame, TimeArrayResampler) and functions (resample, ohlc, mean, sum)
type TimeFrame
f_group::Function
end
type TimeArrayResampler
ta::TimeArray
tf::TimeFrame
end
function resample(ta, f_group)
TimeArrayResampler(ta, f_group)
end
function ohlc(resampler::TimeArrayResampler)
ta = resampler.ta
f_group = resampler.tf.f_group
ta_o = collapse(ta, f_group, first, first)
ta_h = collapse(ta, f_group, first, maximum)
ta_l = collapse(ta, f_group, first, minimum)
ta_c = collapse(ta, f_group, first, last)
a_ohlc = hcat(ta_o.values, ta_h.values, ta_l.values, ta_c.values)
ta_ohlc = TimeArray(ta_o.timestamp, a_ohlc, ["Open", "High", "Low", "Close"])
end
function mean(resampler::TimeArrayResampler)
collapse(resampler.ta, resampler.tf.f_group, first, mean)
end
function sum(resampler::TimeArrayResampler)
collapse(resampler.ta, resampler.tf.f_group, first, sum)
end
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
println("ta=")
println(ta)
# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))
# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
println("ta_ohlc=")
println(ta_ohlc)
# resample using mean values
ta_mean = mean(resample(ta, tf))
println("ta_mean=")
println(ta_mean)
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
println("ta_vol=")
println(ta_vol)
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
println("ta_vol_sum=")
println(ta_vol_sum)
Will you accept a PR that depends on TimeFrames.jl and that define resample
, mean
, ohlc
, sum
functions and some tests
or should I create an other project TimeSeriesResampler ?
resample.jl
using TimeFrames: TimeFrame, dt_grouper
import Base: mean, sum
type TimeArrayResampler
ta::TimeArray
tf::TimeFrame
end
function resample(ta::TimeArray, s_tf::String)
resample(ta, TimeFrame(s_tf))
end
function resample(ta::TimeArray, tf::TimeFrame)
TimeArrayResampler(ta, tf)
end
function ohlc(resampler::TimeArrayResampler)
ta = resampler.ta
f_group = dt_grouper(resampler.tf)
ta_o = collapse(ta, f_group, first, first)
ta_h = collapse(ta, f_group, first, maximum)
ta_l = collapse(ta, f_group, first, minimum)
ta_c = collapse(ta, f_group, first, last)
a_ohlc = hcat(ta_o.values, ta_h.values, ta_l.values, ta_c.values)
ta_ohlc = TimeArray(ta_o.timestamp, a_ohlc, ["Open", "High", "Low", "Close"])
end
function mean(resampler::TimeArrayResampler)
collapse(resampler.ta, dt_grouper(resampler.tf), first, mean)
end
function sum(resampler::TimeArrayResampler)
collapse(resampler.ta, dt_grouper(resampler.tf), first, sum)
end
and tests
using FactCheck
FactCheck.setstyle(:compact)
FactCheck.onlystats(true)
using TimeSeries: TimeArray, resample, ohlc, mean, sum
using TimeFrames: TimeFrame, Minutely
function variation(a; n=1)
a[1+n:end] - a[1:end-n]
end
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
srand(0) # set the random seed in julia generator of random numbers
# Define how timestamp (ie Date or DateTime) should be grouped (using a TimeFrame)
a_tf = [
TimeFrame(dt -> floor(dt, Dates.Minute(15))), # using a lambda function
TimeFrame(Minutely(15)), # using a TimeFrame object (from TimeFrames.jl)
TimeFrame("15T"), # using a string TimeFrame shortcut to create a TimeFrame
"15T", # using a string TimeFrame shortcut
]
for tf in a_tf
facts("resample methods with $tf") do
context("Define a sample timeseries (prices for example)") do
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
context("resample using OHLC values works") do
ta_ohlc = ohlc(resample(ta, tf))
@fact mean(variation(ta_ohlc.timestamp)) --> Dates.Minute(15)
end
context("resample using mean values works") do
ta_mean = mean(resample(ta, tf))
@fact mean(variation(ta_mean.timestamp)) --> Dates.Minute(15)
end
end
context("resample using sum values works") do
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
@fact mean(variation(ta_vol_sum.timestamp)) --> Dates.Minute(15)
end
end
end
Hello,
Resampling time series with Julia could be a nice feature. I noticed this question on SO http://stackoverflow.com/questions/20816813/resampling-a-dataframe-to-hourly-15min-and-5min-periods-in-julia
Maybe someone here can answer it ?
Kind regards