Open heib6xinyu opened 5 months ago
Hi @heib6xinyu ! Thank you very much for providing such a nice example alongside all the code you used to generate it. However, I think that there is indeed a misunderstanding on what the function does. You can find everything I describe in the following also in our docs here and especially here, which contains an example (using similar data as you used actually!) of the rolling mechanism all written out.
But in short: as the name suggests, the function "rolls" a window over your data. It will try to make the window always as large as possible () until it either reaches the end of the data or until it reaches the maximum timeshift parameter. Every window smaller than the minimum timeshift window will be removed. It will not* create all possible windows between min and max timeshift. So the behaviour you are seeing is actually expected.
(*) if you ask why this is the case: feature extraction and any further ML after that works best/makes the most sense if all windows have the same size. So in principle it would be best to have the minimum timeshift parameter set to the maximum timeshift parameter. As this might be a bit "wasteful" (we would throw away a lot of data) we give users the option to choose both parameters independently.
The problem: For the roll time series method, there seems to be an issue regarding the windows it makes (or I understand the function wrong). In short, it looks like only the window of max_timeshift are properly formed. It is a complicated issue to detect and to explain the process of discovering it. I will try my best. I may be understanding the functionality of roll_time_series wrong. But from the description of the roll time series, I am expecting the function to create continuous rolling window of size between min_timeshift and max_timeshift. For example, if I have product a to g, of a time period 0 to 4, and some feature related to them. Say I have min_timeshift 1, max_timeshift 3. Then after I run this data through roll time series, I should have some frame looks like this: For product a: window of shift 1 id timestep features (a, 1) 0 f1 f2... (a, 1) 1 f1 f2... (a, 2) 1 f1 f2... (a, 2) 2 f1 f2... (a, 3) 2 f1 f2... (a, 3) 3 f1 f2... (a, 4) 3 f1 f2... (a, 4) 4 f1 f2... (a, 5) 4 f1 f2... (a, 5) 5 f1 f2... window of shift 2 id timestep features (a, 2) 0 f1 f2... (a, 2) 1 f1 f2... (a, 2) 2 f1 f2... (a, 3) 1 f1 f2... (a, 3) 2 f1 f2... (a, 3) 3 f1 f2... (a, 4) 2 f1 f2... (a, 4) 3 f1 f2... (a, 4) 4 f1 f2... (a, 5) 3 f1 f2... (a, 5) 4 f1 f2... (a, 5) 5 f1 f2... etc. however my discovery of how this function actually performs does not align with the expectation, which is as follow: window of shift 1: (a, 1) 0 f1 f2... (a, 1) 1 f1 f2... window of shift 2: (a, 2) 0 f1 f2... (a, 2) 1 f1 f2... (a, 2) 2 f1 f2... window of shift 3: (a, 3) 0 f1 f2... (a, 3) 1 f1 f2... (a, 3) 2 f1 f2... (a, 3) 3 f1 f2... (a, 4) 1 f1 f2... (a, 4) 2 f1 f2... (a, 4) 3 f1 f2... (a, 4) 4 f1 f2... (a, 5) 2 f1 f2... (a, 5) 3 f1 f2... (a, 5) 4 f1 f2... (a, 5) 5 f1 f2... I guess the issue could be the data from window of shift 2 will overwrite most of the window of shift 1's data, since the id will be the same (product id, end timestep), the only untouched data from window of shift 1 is as follow: (a, 1) 0 f1 f2... (a, 1) 1 f1 f2... But I can't tell for sure. Unless this is exactly what the function is intended to do. But then I am also confused about why make stand alone window of size less than max_timeshift, what is the purpose of those? I cannot provide my dataset, but you can create dummy data as I described, and run the following scripts to see what I mean.
The above code will put the rolling frame of different size (for my example, window of shift 1 has size 2, window of shift 2 has size 3...) into a dictionary. The form of this data dictionary is as follow: data = {window_size: {id_str:[rolling_frames of id_str that has the size of window_size]}} Then, for the window size in data.keys(), you can run the following, replace the number 3 in data[3] with window sizes (ex. 2,3,4).
This is to see for the products that has the size 2,3 and 4, how many rolling windows are in it. length dictionary will have key of number of rolling frame, value of product id that has length.keys() many rolling frame. Then you will find, except for data[4], which is the window size caused by the max_timeshift = 3, all the result of running length.keys() is dict_keys([1]), meaning for all product id of window size 2 and 3, they consist of only 1 rolling window. For example, if I run data[3]['a'], I will have only this result: [(a, 2) 0 f1 f2... (a, 2) 1 f1 f2... (a, 2) 2 f1 f2...] But for data[4]['a'], I'll have he following result: [(a, 3) 0 f1 f2... (a, 3) 1 f1 f2... (a, 3) 2 f1 f2... (a, 3) 3 f1 f2... , (a, 4) 1 f1 f2... (a, 4) 2 f1 f2... (a, 4) 3 f1 f2... (a, 4) 4 f1 f2... , (a, 5) 2 f1 f2... (a, 5) 3 f1 f2... (a, 5) 4 f1 f2... (a, 5) 5 f1 f2...]
Anything else we need to know?:
Environment: