carrtesy / THOC-Pytorch

5 stars 1 forks source link

why there are abnormal data in the training dataset #1

Closed tribeband closed 11 months ago

tribeband commented 11 months ago

it is said that THOC does not require the label when training. but it is found that the anomaly data label are still included in the dataset. refer to the following code for details

def load_NeurIPS_TS_MUL(home_dir="."): base_dir = "data/NeurIPS-TS" normal = pd.read_csv(os.path.join(home_dir, base_dir, "nts_mul_normal.csv")) abnormal = pd.read_csv(os.path.join(home_dir, base_dir, "nts_mul_abnormal.csv"))

    train_X, train_y = normal.values[:, :-1], normal.values[:, -1]
    test_X, test_y = abnormal.values[:, :-1], abnormal.values[:, -1]

    train_X, test_X = train_X.astype(np.float32), test_X.astype(np.float32)
    train_y, test_y = train_y.astype(int), test_y.astype(int)

    return train_X, train_y, test_X, test_y
carrtesy commented 11 months ago

Good Question!

As you mentioned, THOC and other TSAD baselines usually follow unsupervised setting, i.e., no labels included., and also treat all train dataset as normal samples.

Data label what I included is all zeros', (https://github.com/carrtesy/THOC-Pytorch/blob/master/data/NeurIPS-TS/nts_mul_normal.csv) which is just for convenience in implementation, and wanted to write more generalizable code for semi-supervised settings, where a few number of labeled anomlies (may) available.

In dataloader, only train X is utilized (train y is not used) for THOC training: See: https://github.com/carrtesy/THOC-Pytorch/blob/b5863d9a4fc86c2cf1152771472e31f61fd30f9d/exp.py#L66C3-L66C3

Best, Carrtesy

tribeband commented 11 months ago

AHA got it. beautiful idea