bfast2 / bfast

Breaks For Additive Season and Trend
https://bfast2.github.io
GNU General Public License v2.0
41 stars 15 forks source link

Parameter tuning of the bfastlite function #113

Closed nikosGeography closed 1 month ago

nikosGeography commented 2 months ago

I am using the bfastlite() function to run a time-series analysis. From the author's paper (table 2), I quote:

Needs parameter tuning to optimise performance, does not differentiate between breaks in seasonality and trend

So far, I was fine-tuning the model manually, that is, I was changing the parameters one by one, which is time-consuming. Does someone have a better solution regarding the fine-tuning of the model?

To see which parameters of the model achieve the best results, I was checking the dates in the detected breakpoints (visual inspection). I am not sure if that method (visual inspection) is appropriate.

I apologize if this question sound a bit vague, so let me expand a little bit. After running the bfastlite() using the default parameters (i.e., bp = bfastlite(datats)), we get a result. Is there a way to measure (something like rmse, or r-squared) how well the algorithm modeled the ts? What I basically mean is that if there is an index equivalent to let's say rmse when someone is running a linear regression. For example, what if the parameter breaks with BIC instead of LWZ detects more accurate the breakpoints (by visually inspecting the detected breakpoints)? Apart from the visual inspection, shouldn't be some other way to measure the performance of the model?

Based on the above, is there a more efficient way to optimize the parameters of the model (something like a function tune_bfastlite())? What do I mean by optimizing the parameters? I think with an example I can explain it better. When someone is tuning a random forest model, he/she can perform a full grid search to find the optimal parameters of the model (mtry, number of trees, etc) by searching all the possible combinations and for each combination he/she checks the rmse (or mse, r-squared). Is this what the authors of the paper meant when they said "Needs parameter tuning to optimise performance"?

library(bfast)

plot(simts) # stl object containing simulated NDVI time series
datats <- ts(rowSums(simts$time.series))
# sum of all the components (season,abrupt,remainder)
tsp(datats) <- tsp(simts$time.series) # assign correct time series attributes
plot(datats)

# Detect breaks. default parameters
bp = bfastlite(datats)
plot(bp)

# optimized model ??????
bp_opt <- bfastlite()

R 4.4.1, bfast 1.6.1, Windows 11.

rsbivand commented 2 months ago

The same question was posed on https://stat.ethz.ch/pipermail/r-sig-geo/2024-September/029476.html - please check that thread too and report resoltion both here and there.

nikosGeography commented 2 months ago

Yes, I asked the question there as well in case someone else knew the answer of could point me to right direction.

I'll update both posts as soon as I have an answer.

GreatEmerald commented 2 months ago

For the paper, I developed my own validation code. It's freely available, you can find it in the corresponding repository: https://github.com/GreatEmerald/cglops-change-detection/tree/master/src/bfast-cal Specifically, in 03-batchprocessing.r the function TestParams takes a set of parameters as input, runs the model with each set of parameters, and aggregates the results. That said, it requires you to have validation data to test on.

BFAST Lite does provide goodness of fit, including R², which you can get by running summary(bp$breakpoints). But it's not necessarily very useful for actual parameter tuning, because the more parameters you add to the model, the better the fit will be, but then the model will be overfitted and will predict way too many breaks. The BIC and LWZ criteria are better at it, as they penalise for added model parameters. You could use min(summary(bp$breakpoints)$RSS["BIC",]) or min(summary(bp$breakpoints)$RSS["LWZ",]) to compare the best fitting models for each parameter run if you don't have any reference data. Nevertheless, for real applications where you do have reference data, the first option above that I mentioned is better.