ServiceNow / stl-decomp-4j

Java implementation of Seasonal-Trend-Loess time-series decomposition algorithm.
Apache License 2.0
118 stars 46 forks source link

Post-STL seasonal smoothing edge artifacts #6

Open jcrotinger opened 7 years ago

jcrotinger commented 7 years ago

When doing post-STL smoothing of the seasonal component there can be artifacts at the edge. Here is a fit for some synthetic data.

screen shot 2017-07-14 at 5 30 46 pm

The smoother is really too wide here and the artifact isn't intolerable when a more narrow smoother is used, but it is still not optimal.

jcrotinger commented 7 years ago

Hafen's thesis discusses blending the second order LOESS smoothing with a flat LOESS smoothing of half the smoothing width near the edges. I think stlplus can do this for all the LOESS smoothers used in STL. If I get a chance, I'll at least get this in for the post-STL smoothing, which is where I've tended to use the second degree LOESS.

littlehappi commented 6 years ago

Hi @jcrotinger , This is great work. But I have some questions here after reading the old stl-java version of ./brandtg and this one. Does the stl-decomp-4j achieve the same performance as the R version? I meet the same problem as described in https://github.com/brandtg/stl-java/issues/9. I wonder if this new version could solve those problems and how. Many thanks for your time.

jcrotinger commented 6 years ago

Hi @littlehappi,

Thanks!

stl-decomp-4j is a port of the original Fortran that underlies the stl package in R (and the pyloess package for Python, for that matter), with some nods to modern software development techniques, along with an extension to support local quadratic interpolation.

It is Java, so it is not as fast as the Fortran version, but it scales similarly. There are some performance tests in examples/StlPerfTest using both the CO2 data and the 'hourly' data that was the subject of discussion on the stl-java issues list. The stl-java version had more serious performance problems because it was attempting to use a generic LOESS routine that was not specialized for regularly spaced data. (It had a more serious problem of not being correct, of course, and has since been deprecated.)

littlehappi commented 6 years ago

Hi @jcrotinger , Thanks for your answer. I have another question here. I have tried to call the R-stl interface in Java project to decompose multiple time series. However, the R engine is the single thread, meaning that the lines must wait in a queue for R-stl to decompose, which results in a low processing speed with multithread programming in Java. The stl-decomp-4j is the port of the Fortran that underlies the stl package in R, as you mentioned. I wonder if the lines must wait in a queue for decomposing like R, when calling stl-decomp-4j to process multiple time series in a Java project.

jcrotinger commented 6 years ago

@littlehappi The Java code is thread friendly - there is no non-constant shared static state between different SeasonalTrendLoess instances. So as long as each thread has its own instance, independent decompositions can be done in parallel. (The Fortran code is also technically thread friendly, in that all work arrays are passed around on the stack. I've not tried to call R from Java, but I'm guessing that it is the R engine that is the threading issue there, not the underlying stl.)

littlehappi commented 6 years ago

@jcrotinger Hi! I have read the source code, but it is questionable about how the stl-decomp-4j calls the port of the Fortran. I thought that the loess smoothing method decides the fitting performance and checked it. This method is also based on Java, except a part of commented-out code which has mentioned Fortran, like: 2018-01-05 6 42 23 Maybe i misunderstood something here?

jcrotinger commented 6 years ago

Hi @littlehappi,

The stl-decomp-4j package is a Java implementation of the algorithm in the original Fortran (RATFOR) code. It does not call Fortran at all. It is a port from Fortran to Java of the same approach, using the same underlying loess approach (equally spaced points with no missing data), etc. It gets the same answers, within roundoff, and has the same scaling in time and memory complexity. It does do dynamic memory allocation, which wasn't available in the original Fortran, and I have some TODOs about possibly pre-allocating these arrays and using them in order to create less garbage, but it hasn't been a high priority.

One could write a JNI wrapper to call the original Fortran, and I think there is a github project with a start on such a wrapper. Calling the native Fortran would be faster, but also less safe. I needed a Java implementation for an environment that does not allow calling unmanaged code, looked at stl-java and found it to have problems that could not be simply fixed, so I ported the original code to Java (and later enhanced it to support local quadratic interpolation) and wanted to give it back to the community since others were looking for the same sort of functionality.

The comments you referenced are copies of a section of the original Fortran code that I used when debugging my port. Fortran is indexed from 1, and the loop logic is somewhat messy, so I wanted to make sure that I understood it and wrote it correctly in Java. Thus the example calculations off to the right of the ratfor code. :)

littlehappi commented 6 years ago

Hi @jcrotinger , I tried some time series smoothing experiments with the stl-4j these days. It seems that this version performs a bit worse than the R version. Sometimes a sudden drop appears in the decomposition result of the java version. like 2018-01-18 9 14 44 Do you have any advice on it? Maybe the quatratic loess regression or the more inner/outer loops could be helpful? By the way, i wonder if it is possible that the java stl can generate the complete same result as the R version. How to do it? Thanks.

jcrotinger commented 6 years ago

@littlehappi Can you send add an example so I can try to reproduce the result? What does the raw data look like? This decomposition result does not look like there is any particular seasonality or trend to extract. Also, it looks odd because normally the seasonal component oscillates around 0. Also, can you post it as a new issue please? This has strayed from the subject of my original issue above. :)

The R version actually calls the native Fortran. The Java version will not be as fast. But using the Fortran version requires running unmanaged code, which isn't always possible. That's why I wrote it. :)

There is a comparison between Fortran and stl-dcomp-4j for the CO2 data - see /examples/StlPerfTest/StlJavaFortranComparison.ipynb.

williewheeler commented 6 years ago

I did create a Java wrapper around the original Fortran code:

https://github.com/ExpediaDotCom/stl-jni

Have been using it for a couple years now (open sourced it about a year ago) and it works fine. So that option is available. But I'm excited to see a pure Java version so I'll give yours a try.

jcrotinger commented 6 years ago

@williewheeler I saw that - I have applications that can't use unmanaged code, which is why I went this route, but it definitely isn't as fast. It would be nice to have the Builder be able to build an unmanaged version if one is available - might look at that if I have time at some point.