ant-research / Pyraformer

Apache License 2.0
237 stars 37 forks source link

Q: so for App flow dataset, the only feature is time? #25

Open mw66 opened 1 year ago

mw66 commented 1 year ago

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L19-L22

extract: time, weekday, hour, month

and is used here:

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L54-L57

I'm just wondering:

1) why, for example, not using zone (convert to some integer) as extra features, and in that case, how does this model perform?

2) or: if the train data only contains the single time feature (without weekday, hour, month), will this model still perform?

Sorry for the silly questions, want to hear your insight.

Thanks.

Zhazhan commented 1 year ago

Hi,

  1. The information of 'zone' and 'app_name' is actually used, see https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L13 and https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L57. Each 'app_name' in each 'zone' corresponds to a time series, so we convert the 'app_name' and 'zone' information into an integer, namely, the 'seq_id'.
  2. It is also possible to make predictions based solely on historical time series. Following previous works, our implementation introduced these covariates.
mw66 commented 1 year ago

Ok, so the app_name and zone are there, but how about the previous value of the raw input sequence (inside the window size)?

Let's check the raw input sequence data, in: https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L17-L26

        single_df = grouped_data[i][1].drop(labels=['app_name', 'zone'], axis=1).sort_values(by="time", ascending=True)
        times = pd.to_datetime(single_df.time)
        single_df['weekday'] = times.dt.dayofweek / 6
        single_df['hour'] = times.dt.hour / 23
        single_df['month'] = times.dt.month / 12
        temp_data = single_df.values[:, 1:]    # L22, 'time' column is dropped here
        if (temp_data[:, 0] == 0).sum() / len(temp_data) > 0.2:
            continue

        all_data.append(temp_data)

we can see temp_data[:, 0] is the raw input sequence ('app_name', 'zone' are dropped on L17, and 'time' is dropped on L22, so temp_data[:, 0] is the 'value' in the original csv file.

Then, in https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L55

  single_data[:, 0] = seq_data.copy()

is the real raw input sequence data,

but in https://github.com/ant-research/Pyraformer/blob/master/data_loader.py#L513-L518

        cov = all_data[:, :, 1:]   # the real raw input sequence data 'value' (all_data[:, :, 0]) dropped?

        split_start = len(label[0]) - self.pred_length + 1
        data, label = split(split_start, label, cov, self.pred_length)

        return data, label

it's dropped from the training data?

That's my question: so the previous value of the raw input sequence value is not used at all in training?