Closed mirt001 closed 3 years ago
Thank you for reporting this. I'll look into it
made some minor edits to the initial comment that do not affect the gist of the issue. In the R version I was wrongly assuming that calendars are 0 based, so I was miscalculating the day of year by 1.
after some more tests, in which it happened that there was a break in the last day of the monitoring period, I noticed that the breaks property is a 1 based index. Is this correct? By this I mean that the first date in the monitoring period must have an index of 1, in order for the breaks property to be used correctly as an index. The implication being that my above script is wrong by 1 date. Unfortunately it moves the detected break even further, not closer to the R version. Am I doing something wrong with the indexing? I am editing the comment to correct this bug in my script, but please check the edits to better understand what I mean.
Good evening,
I subscribed to this issue since I'm dealing with the same problem for some of my currents tests, both in Python and OpenCL.
I just was reading the paper (https://arxiv.org/pdf/1807.01751.pdf), and take some time seeing the R files implementation (https://www.rdocumentation.org/packages/strucchange/versions/1.5-2/source - monitoring.R), focusing on breaks calculation.
From the paper, I could understand that (among other things) the calculation of the breaks is dependent on the calculation of bounds values. In the paper the bounds are computed with the following expression: with: being t the monitoring period.
After inspecting the R files I found the "OLS-MOSUM" method the most identical to what is done in Python. In the method, the bounds are calculated as:
The formula is identical to what is done in Python since 'lam' (critval) is previously multiplied by the sqrt of 2. However, I think that in Python the division value inside the log is calculated differently from the R code. 't', as stated in the paper should be the monitoring period and 'n' the last value of the history, but in Python, you do this:
I think you divide by the last mapped_indice of all time serie. What do you think about:
Can you give me some feedback about this practice?
Thank you Dmitry!
Hi @Carolina710. Python is 0-indexed, in comparison to R, which is 1-indexed. Furthermore, in Python, you can index with -1 to get the last value in the array, hence array[-1] and array[n-1] are equivalent. Se here for more information
I am currently looking into the issue
Hi, thanks for your reply.
I understand that logic, I just thought that "n" or "self.n" (depending on your version) was the number of data images from the period of history, right? Like: "self.n = self._compute_end_history (dates)". I also thought that the mapped indices were calculated for all time series, including the history and the monitoring periods, with a total of "N" data images.
So, in this scenario, you take the last mapped index of all time series, that is also the last data from the monitoring period [-1] or [N-1], not the last one in history [n-1], I think ...? And my question was, shouldn't it be the last mapped index in history? Like, "histsize" in R.
Thank you very much!
Hi @mirt001 and @Carolina710 This particular bug should be fixed now. However, I need to look more thoroughly into the boundary computation soon (including the issue @Carolina710 mentioned).
Thanks everyone!
I think that I have misunderstood you answer earlier @Carolina710. You are right, it should be divided by the last index in the history period. I have pushed it to the develop branch. Nice catch!
Thanks @mortvest and also thanks @Carolina710 I just checked the results using the latest develop branch, on the above Python test, and while the results are closer, they are still different in the sense that the break is detected at a different date. The new detected date for python is 2018 146 , or 2018-05-26, which is one date too early, when compared to the R version, which detects the break on 2018 162, or 2018-05-26. Are you getting different results?
I will test this more to see if Python is always 1 date away, or if there are more inconsistencies.
The returned breakpoint is 0-indexed, so you don't need to subtract 1 from it. I get 2018 163 this way. The small error is due to the way the dates are converted, which is outside of the scope of this implementation. The output breakpoint index is the same as in R version.
I am using the following Python and R bfast versions at the date of the creation of this issue.
To minimize complexity, I created these 2 self-contained python and R scripts, that are almost direct translations:
2018 122
2018 162
I believe that both runs are run with the same parameters, and they should give the same break date. Am I missing some parameter?