Make performance comparable to R / Fortran

brandtg commented 8 years ago

On a timeseries of 89345 points with seasonality 8736 (hourly over ~10 yr), it takes 159.25s to compute STL, whereas in R it takes 1.37s.

This may be due to a number of things:

Bandwidth used in LoessInterpolator vs. fortran loess, which introduces O(N * bandwidth) factor
Loess robustness iterations (default: 0, vs fortran: ???)
Inner loop passes (default: 10 vs. R's default: 2 w/ robust = FALSE)
LoessInterpolator java vs. fortran implementation details

Given that we have configured 10 vs. 2 inner loop passes to achieve good result, and are using a relatively high bandwidth (~0.60 to 0.70), this is likely the cause for the orders of magnitude difference.

We need to figure out how to choose the correct bandwidth at each invocation of Loess, as suggested by the paper. After this is identified, we can analyze differences between Java / fortran implementation.

hntd187 commented 8 years ago

@brandtg Well we're probably not gonna beat a good fortran impl on anything numeric, that's just the breaks of it. We could work to perhaps be smarter and more efficient in the array usage, I mean after all what is the data in this case but a very large perhaps sparse vector? We could in theory perhaps try and use some vector math libraries and then represent sub series as perhaps a sparse matrix or something. I'm thinking out loud here, but that might give us some noticeable performance. Perhaps also a native lib for Loess? I mean the R implementation isn't fast because of R, it's fast because of Fortran.

hntd187 commented 8 years ago

@brandtg Perhaps consider https://github.com/fommil/matrix-toolkits-java and https://github.com/bartverm/loess as something we could adapt?

brandtg commented 8 years ago

@hntd187 Nice I will check those libraries out, and thanks for the suggestions. I agree that we probably won't hit Fortran speeds, but my intuition is that we could probably get an order of magnitude improvement by figuring out how to use Loess better, and reducing memory copies. I profiled the process on this input a bit, and found that a lot of time was spent in loess during low pass filter and trend smoothing.

hntd187 commented 8 years ago

@brandtg I actually was thinking about this a bit today and maybe it might not be a horrible idea to throw together a JNI C Impl of Loess together real quick and see if that would cut down on the execution time. I could probably work together a hackey version of that in a day or so. Of course, native libs come with all the problems of the world, but the performance is there.

brandtg commented 8 years ago

@hntd187 Sounds good to me. It might also be interesting to run something like f2j on the original STL fortran implementation, and see if we can wrap the result in a similar way to the way R does.

hntd187 commented 8 years ago

@brandtg I've kinda almost got a C impl done through the JNI. The JNI is terrible, I'll just let you know right now, but I might have this done today, but I also got a copy of Tom Clancy's The Division, so there is that.

I also think perhaps this whole performance question begs the conversation about where are we hoping to go with this eventually? I mean it's ultimately your choice, but I think perhaps if we decide to refactor a lot of the code with performance in mind we might want to think about if we're going to turn this into a more complete library or if we're just trying to create just a super tight impl of stl or if perhaps create the basis of perhaps time series decomposition and expand with other methods.

brandtg commented 8 years ago

@hntd187 Yeah, good point - I'm not really sure what the eventual goals are here. The original goal was just to have something comparable to R to run in MR or Spark jobs, and avoid having to wrap R.

For example, it takes R 1.34 seconds to generate the following result on 10yr of hourly data, seasonal yearly:

r-plot

Whereas it takes stl-java 1.31 minutes to generate the following result on the same data

hourly-stl-result

At this point, I believe we've reached the good enough threshold for the original intended use case, as this only becomes pronounced on very large series, and a marginal 1m in MR or Spark context is relatively small. The output isn't exactly the same, but things look decent at a high level.

My hope in this issue is that we can find a way to decrease that order of magnitude difference, but at minimal additional code complexity, and with the same portability (so pure JVM preferred).

brandtg commented 8 years ago

Also FWIW running on smaller data sets, e.g. couple months of hourly data with weekly trend, does not have this performance problem, so it may not be worth it to chase further until there's some concrete use case in which it's a problem.

hntd187 commented 8 years ago

@brandtg check this out, I wrote this today, maybe wanna try and bench mark a bit with it and see how it does? Fair warning, I am terrible at C, so if it leaks horrible amounts of memory I am sorry ahead of time. https://github.com/hntd187/Loess-C

hntd187 commented 8 years ago

@brandtg But also I ran it a few times, hardly a concrete case and it tended to be about 2-5x faster than the commons implementation. I'm sure though the C can be made faster, there also is no error catching too.

brandtg commented 8 years ago

@hntd187 Nice, we should give this a shot and see if we can eliminate the order of magnitude performance discrepancy. Maybe make another config in StlConfig that can switch between native and LoessInterpolator to test.

I think I've identified that LoessInterpolator is the bottleneck. Here is the top 5 CPU hogs from profiling:

CPU SAMPLES BEGIN (total = 2014) Sun Mar 20 20:14:02 2016
rank   self  accum   count trace method
   1 52.09% 52.09%    1049 300117 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth
   2 39.23% 91.31%     790 300121 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth
   3  3.28% 94.59%      66 300168 java.io.FileOutputStream.writeBytes
   4  0.45% 95.03%       9 300125 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth
   5  0.40% 95.43%       8 300126 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth

The top two correspond to these stack traces

TRACE 300117:
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:301)
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:395)
        com.github.brandtg.stl.StlDecomposition.loessSmooth(StlDecomposition.java:436)
        com.github.brandtg.stl.StlDecomposition.lowPassFilter(StlDecomposition.java:407)
TRACE 300121:
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:301)
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:395)
        com.github.brandtg.stl.StlDecomposition.loessSmooth(StlDecomposition.java:436)
        com.github.brandtg.stl.StlDecomposition.decompose(StlDecomposition.java:165)

Which correspond to low pass filter and trend smoothing, respectively. These both consider the entire series, as opposed to the smoothing of cycle subseries. Given that neither dominates the other, I'm guessing it's the library.

The loop the profiler identifies as hot is directly proportional to bandwidth size. Also, for better or worse, LoessInterpolator#smooth has several O(n) invariant checks and array copies.

It also uses bandwidth as a percentage of points, as opposed to the Fortran's integer value, so I think we need to be more careful about that calculation as the input scales.

We probably just need to come up with a smarter way to compute the bandwidth parameter given the input. But I think there's value in pursuing the native Loess implementation that you have, to allow the user to choose between pure Java / safety and speed.

At any rate, since there's no noticeable speed difference on reasonable (by human standards) inputs, e.g. years of monthly data, or months of hourly data, the commons math LoessInterpolator is probably reasonable to keep around at least.

brandtg commented 8 years ago

I was able to reduce the number of inner loop passes from 10 to 2 (R's choice) after commit 8bbc947a621eb51fcdea8ce124a6591934cbebf9. The cycle series were not being properly padded, so we had to run many more times to remove weirdness introduced there. This was my bad.

This cuts from ~1.31m to ~27s:

>> time java -Dinner.loop=2 -cp /Users/greg_brandt/.m2/repository/org/apache/commons/commons-math3/3.2/commons-math3-3.2.jar:/Users/greg_brandt/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.5.1/jackson-core-2.5.1.jar:/Users/greg_brandt/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.5.1/jackson-databind-2.5.1.jar:/Users/greg_brandt/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.5.0/jackson-annotations-2.5.0.jar:/Users/greg_brandt/.m2/repository/org/testng/testng/6.8.7/testng-6.8.7.jar:/Users/greg_brandt/.m2/repository/junit/junit/4.10/junit-4.10.jar:/Users/greg_brandt/.m2/repository/org/hamcrest/hamcrest-core/1.1/hamcrest-core-1.1.jar:/Users/greg_brandt/.m2/repository/org/beanshell/bsh/2.0b4/bsh-2.0b4.jar:/Users/greg_brandt/.m2/repository/com/beust/jcommander/1.27/jcommander-1.27.jar:/Users/greg_brandt/.m2/repository/org/yaml/snakeyaml/1.12/snakeyaml-1.12.jar:/Users/greg_brandt/.m2/repository/org/jfree/jfreechart/1.0.19/jfreechart-1.0.19.jar:/Users/greg_brandt/.m2/repository/org/jfree/jcommon/1.0.23/jcommon-1.0.23.jar:/Users/greg_brandt/.m2/repository/joda-time/joda-time/2.8.1/joda-time-2.8.1.jar:target/* com.github.brandtg.stl.StlDecomposition 8736 < ~/Downloads/hourly.csv > /tmp/stl.csv

real    0m27.178s
user    0m28.546s
sys 0m0.402s

And the result looks more in line with R's

r-vs-java

Now all the CPU is in lowPassFilter:

TRACE 300106:
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:301)
        org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth(LoessInterpolator.java:395)
        com.github.brandtg.stl.StlDecomposition.loessSmooth(StlDecomposition.java:465)
        com.github.brandtg.stl.StlDecomposition.lowPassFilter(StlDecomposition.java:436)
...
CPU SAMPLES BEGIN (total = 988) Mon Mar 21 12:53:49 2016
rank   self  accum   count trace method
   1 88.36% 88.36%     873 300106 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth
   2  7.59% 95.95%      75 300150 java.io.FileOutputStream.writeBytes
   3  0.30% 96.26%       3 300090 com.github.brandtg.stl.StlDecomposition.main
   4  0.20% 96.46%       2 300108 org.apache.commons.math3.analysis.interpolation.LoessInterpolator.smooth
   5  0.10% 96.56%       1 300089 java.nio.Buffer.position

I would bet that we can pick smarter bandwidths for lowPassFilter for large series, and cut down from O(10s) to O(1s) speeds.

hntd187 commented 8 years ago

@brandtg are you able to provide any of this data so I can profile against the same data as you?

brandtg commented 8 years ago

Here's the data in CSV

hourly.txt

jcrotinger commented 8 years ago

FYI, I still see about two orders of magnitude difference - this runs in about 27 seconds with stl-java and the Python implementation (the same Fortran at its core as the R implementation) runs in 370 ms.

jcrotinger commented 8 years ago

@brandtg @hntd187 Don't know why I didn't notice this earlier, but one big difference between what is done here and the R/Fortran version is that there is no option to skip points in Loess. By default, I believe these only use every 10th point and do linear interpolation of the Loess curve in between. Another reason to look at implementing univariate Loess especially for this purpose.

brandtg commented 8 years ago

@jcrotinger I see, the skip points thing may indeed be the reason. Do you think that it would be worthwhile to submit a patch to the commons-math LoessInterpolator? Or would we benefit from implementing univariate Loess ourselves?

@hntd187 has a C-based Loess that we could start with (mentioned in this thread), or IMO it would be good if we can just port to Java and see if skip points will get us to the same order of magnitude. Would prefer to keep things in pure Java if the performance isn't too bad.

jcrotinger commented 8 years ago

@brandtg Because the way STL uses LOESS is so special-purpose, I'm thinking that a port of the original version is likely the best way to go. The use of "bandwidth" in the interface isn't really consistent with the ideas behind STL - whether I'm looking at 10 months or 10 years of hourly data, I want to use the same loess smoothing parameters. Also, bandwidth makes it a pain to check that the bandwidth is odd and obeys the other limits discussed in the paper. I've looked at @hntd187's C version but it is trying to look too much like the Java, which isn't a plus from my perspective. Also, my app is in Java. If I have to bring in a native dependency, I might as well just use the fortran. :)

hntd187 commented 8 years ago

@jcrotinger well the C impl is basically copied from the Java impl so there is really nothing even remotely smart about it's implementation. I literally just wanted something to compare at a base level, there are a ton of things to do to make it faster, so don't read too much into the impl.

brandtg commented 8 years ago

@jcrotinger Got it, a port of the original version makes sense

brandtg / stl-java

Make performance comparable to R / Fortran #8