[Data Preprocessing] Questions about the data preprocessing procedure.

customtiy13 commented 1 year ago

I would greatly appreciate it if you could elaborate on how to process the dataset.

In the Datasets section, it says that all datasets are processed as a sliding window view, and the format is composed of 4 numpy.ndarray objects.

Could you explain what these "x,y,x_offset,y_offset" mean? or better yet, release the preprocessing code.

Thank you very much for your time and attention to my inquiries.

Echo-Ji commented 1 year ago

Hi, thank you for your attention. Here's a refined explanation:

x represents historical data samples, with each sample having the shape (#lookback_window, #nodes, #flow_types).
y denotes the labels, which are composed of future data samples. Each y sample has the shape (#predict_horizon, #nodes, #flow_types).
x_offset indicates the offsets related to the lookback window of x. For instance, if the current time is 10:00, the offset of 8:00 is -2 when working on an hourly basis but -4 when dealing with 30-minute intervals. It's important to note that we consider the most recent time index as having a 0 offset.
y_offset represents the offsets for the prediction horizon of y, which is typically set to 1 in our configuration.

For instance, if you are using data from the previous 12 time steps to forecast the next one, the offsets for x and y should be [-11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0] and [1], respectively.

I hope this explanation addresses your question. If you have further questions, please do not hesitate to ask me. :smile:

jnjnjmjm commented 7 months ago

Hi, I have some questions about the x_offsets sequence. I would appreciate it if you spend some time to answer my question. I am debugging the code on NYCBike_1 dataset, and get a x_offsets sequence like [-73, -72, -71, -70, -69, -49, -48, -47, -46, -45, -25, -24, -23, -22, -21, -3, -2, -1, 0]. According to the description in the paper, I think the first 15 timesteps are data of past 3 days. In this case, the last 4 timesteps will be past 4 hours' data, rather than previous 2-hour data as described in the paper. This problem made me confused. Is this part of the paper described accurately? Or I misunderstood the meaning of data? Thank you for your time to address my doubts.

Echo-Ji commented 7 months ago

Hi, I have some questions about the x_offsets sequence. I would appreciate it if you spend some time to answer my question. I am debugging the code on NYCBike_1 dataset, and get a x_offsets sequence like [-73, -72, -71, -70, -69, -49, -48, -47, -46, -45, -25, -24, -23, -22, -21, -3, -2, -1, 0]. According to the description in the paper, I think the first 15 timesteps are data of past 3 days. In this case, the last 4 timesteps will be past 4 hours' data, rather than previous 2-hour data as described in the paper. This problem made me confused. Is this part of the paper described accurately? Or I misunderstood the meaning of data? Thank you for your time to address my doubts.

You are right. It is previous 2-hour data in other three datasets, but 4-hour data in NYCBike_1 dataset.

zhangruiouc commented 6 months ago

Hello, when I was looking at the relevant configuration file of NYCBike2 NYCBike2.yaml, I found that input_length time 8 + 9 3 = 35, if you use the traffic flow information of the first two hours of the current moment and the traffic flow information near the current moment in the previous three days, according to the sampling rate of the dataset is 30min, isn't the input_length 2h/30min+93=31? Hope to get your answer, thank you very much!

Echo-Ji commented 1 week ago

Hi, thanks for your attention and sorry for the late reply.

The data of the current day is 4 hours, not 2 hours.

Echo-Ji / ST-SSL

[Data Preprocessing] Questions about the data preprocessing procedure. #2