JuliaStats / TimeSeries.jl

Time series toolkit for Julia
Other
353 stars 69 forks source link

Construct opt #361

Closed ancapdev closed 6 years ago

ancapdev commented 6 years ago

I've been trying to use TimeSeries.jl on some high frequency datasets and ran into some performance bottlenecks in TimeArray construction. In particular the uniqueness test is heavy.

I've made a couple of changes here that help my use cases:

Rough performance improvements for something on the scale of my use case, 10 million entries, 1 column:

If these changes look reasonable I have a few more I'd like to add. To extend this I'd change some of the algorithms to construct TimeArray objects unchecked. I also noticed a few places that don't handle Float32 properly, and I'd like to rectify that.

codecov-io commented 6 years ago

Codecov Report

Merging #361 into master will decrease coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #361      +/-   ##
==========================================
- Coverage   86.22%   86.19%   -0.03%     
==========================================
  Files          10       10              
  Lines         479      478       -1     
==========================================
- Hits          413      412       -1     
  Misses         66       66
Impacted Files Coverage Δ
src/timearray.jl 98.48% <100%> (-0.02%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d462cd5...0e2383b. Read the comment docs.

femtotrader commented 6 years ago

@iblis17 could you review this PR ? Thanks

iblislin commented 6 years ago

Could you add some test cases?

iblislin commented 6 years ago

Seem it can be faster, could you try this patch out?

--- a/src/timearray.jl
+++ b/src/timearray.jl
@@ -23,19 +23,23 @@ struct TimeArray{T, N, D <: TimeType, A <: AbstractArray{T, N}} <: AbstractTimeS
             timestamp::AbstractVector{D},
             values::A,
             colnames::Vector{String},
-            meta::Any;
-            unchecked = false) where {T, N, D <: TimeType, A <: AbstractArray{T, N}}
+            meta::Any,
+            unchecked::Bool) where {T, N, D <: TimeType, A <: AbstractArray{T, N}}
         nrow, ncol = size(values, 1, 2)
@@ -46,12 +50,12 @@ end
 TimeArray(d::AbstractVector{D}, v::AbstractArray{T, N},
           c::Vector{S}=fill("", size(v,2)),
           m::Any=nothing;
-          args...) where {T, N, D <: TimeType, S <: AbstractString} =
-    TimeArray{T, N, D, typeof(v)}(d, v, map(String, c), m; args...)
+          unchecked=false) where {T, N, D <: TimeType, S <: AbstractString} =
+    TimeArray{T, N, D, typeof(v)}(d, v, map(String, c), m, unchecked)
 TimeArray(d::D, v::AbstractArray{T, N}, c::Vector{S}=fill("", size(v, 2)),
           m::Any=nothing;
-          args...) where {T, N, D <: TimeType, S <: AbstractString} =
-    TimeArray{T, N, D, typeof(v)}([d], v, map(String, c), m; args...)
+          unchecked=false) where {T, N, D <: TimeType, S <: AbstractString} =
+    TimeArray{T, N, D, typeof(v)}([d], v, map(String, c), m, unchecked)
ancapdev commented 6 years ago

Tests, sure, you mean for the new unchecked path? Other than that there's no new functionality.

Regards the defaulted regular argument vs keyword argument, I can benchmark, but I expect it to make very marginal performance difference given the context of everything else going on (e.g., replace_dupes()), and I to my mind the API is much worse. Unnamed boolean arguments don't tell you much at the call site, and if you were to later extend the API it could get messy. Happy to do whatever is desired, but this doesn't seem the best trade off.

iblislin commented 6 years ago

Unnamed boolean arguments don't tell you much at the call site, and if you were to later extend the API it could get messy. Happy to do whatever is desired, but this doesn't seem the best trade off.

I expect that user usually call the outer constructor instead of inner constructor, so the positional argument of inner constructor is fine for me. User still has TimeArray(.., unchecked = true).

But, yeah, as you said, it's only very marginal performance difference. If you do not want to change it, I'm ok.

ancapdev commented 6 years ago

Cool, if we can get this change in, I'll be rebasing and preparing a PR for optimizing the merge operation next, a more interesting change ;)

iblislin commented 6 years ago

Cool, if we can get this change in

sorry, you mean this PR or my patch?

ancapdev commented 6 years ago

This PR