NetManAIOps / DejaVu

Code and datasets for FSE'22 paper "Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems"
MIT License
73 stars 16 forks source link

Reproducibility Issue #4

Open lizeyan opened 2 years ago

lizeyan commented 2 years ago

Here I list my environment, command and the outputs to run an experiment.

Environment

Machine

硬件概览:

  型号名称: MacBook Pro
  型号标识符:    MacBookPro14,3
  处理器名称:    四核Intel Core i7
  处理器速度:    2.9 GHz
  处理器数目:    1
  核总数:  4
  L2缓存(每个核):    256 KB
  L3缓存: 8 MB
  超线程技术:    已启用
  内存:   16 GB
  系统固件版本:   451.140.1.0.0
  操作系统加载程序版本:   540.120.3~19
  SMC版本(系统):    2.45f5

Docker Info

$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.2)
  compose: Docker Compose (Docker Inc., v2.6.1)
  extension: Manages Docker extensions (Docker Inc., v0.2.7)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 2
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.774GiB
 Name: docker-desktop
 ID: WKZ5:6KJZ:3K6S:I7WY:LTLR:3TNP:D23G:N3C7:6QPG:KG44:WEEK:CVMR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Username: lizytalk
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

Docker Image Info

REPOSITORY        TAG       IMAGE ID       CREATED        SIZE
lizytalk/dejavu   latest    32d6db301926   2 months ago   17.3GB

Command

docker run -it --rm -v $(realpath .):/workspace lizytalk/dejavu bash -c 'source .envrc && python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20' Note that --max_epoch=20 is used to validate the program fastly.

Output


=============
== PyTorch ==
=============

NVIDIA Release 21.11 (build 29224839)
PyTorch Version 1.11.0a0+b6df043

Container image Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2021 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Using backend: pytorch
2022-09-03 03:02:35.245 | INFO     | failure_dependency_graph.FDG_config:process_args:62 - torch.cuda.is_available()=False
2022-09-03 03:02:35.332 | INFO     | DejaVu.workflow:_train_exp_CFL:34 -
================================================Config=============================================
{'FI_feature_dim': 3,
 'GAT_layers': 8,
 'GAT_num_heads': 4,
 'GAT_residual': True,
 'GAT_shared_feature_mapper': False,
 'augmentation': False,
 'balance_train_set': True,
 'batch_size': 16,
 'cache_dir': PosixPath('/tmp/SSF/.cache'),
 'checkpoint_metric': 'val_loss',
 'cuda': False,
 'data_dir': PosixPath('data/A1'),
 'dataset_split_ratio': (0.4, 0.2, 0.4),
 'display_epoch_freq': 10,
 'display_second_freq': 5,
 'drop_FDG_edges_fraction': 0.0,
 'dropout': False,
 'early_stopping_epoch_patience': 500,
 'es': True,
 'faults_path': None,
 'feature_projector_type': 'GRU',
 'flush_dataset_cache': True,
 'gradient_clip_val': 1.0,
 'graph_config_path': None,
 'init_lr': 0.01,
 'max_epoch': 20,
 'metrics_path': None,
 'output_base_path': PosixPath('/SSF/output'),
 'output_dir': PosixPath('/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139'),
 'p': 0.25,
 'q': 0.25,
 'random_walk_length': 8,
 'rec_loss_weight': 1.0,
 'test_batch_size': 128,
 'test_epoch_freq': 100,
 'test_second_freq': 30.0,
 'train_set_repeat': 1,
 'train_set_sampling': 1.0,
 'ts_feature_mode': 'full',
 'use_anomaly_direction_constraint': False,
 'valid_epoch_freq': 10,
 'weight_decay': 0.01,
 'window_size': (10, 10)}
===================================================================================================

2022-09-03 03:02:36.842 | INFO     | DejaVu.workflow:_train_exp_CFL:39 - reproducibility info: {'command_line': 'python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20', 'time': 'Sat Sep  3 03:02:35 2022', 'git_root': '/workspace', 'git_url': 'https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a', 'git_has_uncommitted_changes': False}
2022-09-03 03:02:37.800 | INFO     | failure_dependency_graph.failure_dependency_graph:_load_FDG:206 - Loading FDG from data/A1/FDG.pkl
2022-09-03 03:02:40.971 | INFO     | failure_dependency_graph.model_interface:__init__:47 - dataset_cache_dir=/tmp/SSF/.cache/faults=data_A1_faults.csv.graph=data_A1_graph.yml.metrics=data_A1_metrics.norm.pkl.use_anomaly_direction_constraint=False
2022-09-03 03:02:41.019 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:209 - fault ids with multiple root causes: []
2022-09-03 03:02:41.019 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('Docker CPU',)
train_length=7     train_ids=[5, 23, 12, 37, 0, 58, 30]
validation_length=4     validation_ids=[75, 29, 3, 15]
test_length=8     test_ids=[65, 59, 18, 50, 44, 41, 8, 52]
(7   recurring faults)
2022-09-03 03:02:41.020 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('Docker',)
train_length=12    train_ids=[35, 25, 56, 71, 53, 17, 13, 9, 6, 1, 7, 26]
validation_length=6     validation_ids=[2, 21, 63, 68, 70, 54]
test_length=12    test_ids=[24, 31, 28, 77, 48, 49, 16, 67, 76, 62, 60, 69]
(12  recurring faults)
2022-09-03 03:02:41.021 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('DB Session',)
train_length=2     train_ids=[36, 47]
validation_length=2     validation_ids=[74, 45]
test_length=3     test_ids=[19, 4, 57]
(1   recurring faults)
2022-09-03 03:02:41.021 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('DB State',)
train_length=2     train_ids=[46, 11]
validation_length=1     validation_ids=[27]
test_length=2     test_ids=[34, 10]
(2   recurring faults)
2022-09-03 03:02:41.022 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('OS Network',)
train_length=6     train_ids=[22, 39, 64, 61, 20, 38]
validation_length=4     validation_ids=[51, 42, 55, 32]
test_length=7     test_ids=[40, 14, 33, 66, 43, 73, 72]
(4   recurring faults)
2022-09-03 03:02:41.023 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [5, 23, 12, 37, 0, 58, 30] for 1 times
2022-09-03 03:02:41.024 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [35, 25, 56, 71, 53, 17, 13, 9, 6, 1, 7, 26] for 1 times
2022-09-03 03:02:41.024 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [36, 47] for 6 times
2022-09-03 03:02:41.024 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [46, 11] for 6 times
2022-09-03 03:02:41.025 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [22, 39, 64, 61, 20, 38] for 2 times
2022-09-03 03:02:41.025 | INFO     | failure_dependency_graph.model_interface:split_failures_by_type:265 - len(train_list)=55 len(set(train_list))=29 len(validation_list)=17 len(test_list)=32
2022-09-03 03:02:41.081 | INFO     | DejaVu.workflow:_train_exp_CFL:51 -
==========================Model Summary================================================================================================================
Layer (type:depth-idx)                   Param #
=================================================================
GAT                                      --
├─FIFeatureExtractor: 1-1                --
│    └─ModuleList: 2-1                   --
│    │    └─GRUFeatureModule: 3-1        --
│    │    │    └─GRU: 4-1                81
│    │    │    └─Sequential: 4-2         --
│    │    │    │    └─Reshape: 5-1       --
│    │    │    │    └─Conv1d: 5-2        100
│    │    │    │    └─GELU: 5-3          --
│    │    │    │    └─Flatten: 5-4       --
│    │    │    │    └─Linear: 5-5        543
│    │    │    │    └─Reshape: 5-6       --
│    │    └─GRUFeatureModule: 3-2        --
│    │    │    └─GRU: 4-3                288
│    │    │    └─Sequential: 4-4         --
│    │    │    │    └─Reshape: 5-7       --
│    │    │    │    └─Conv1d: 5-8        100
│    │    │    │    └─GELU: 5-9          --
│    │    │    │    └─Flatten: 5-10      --
│    │    │    │    └─Linear: 5-11       543
│    │    │    │    └─Reshape: 5-12      --
│    │    └─GRUFeatureModule: 3-3        --
│    │    │    └─GRU: 4-5                81
│    │    │    └─Sequential: 4-6         --
│    │    │    │    └─Reshape: 5-13      --
│    │    │    │    └─Conv1d: 5-14       100
│    │    │    │    └─GELU: 5-15         --
│    │    │    │    └─Flatten: 5-16      --
│    │    │    │    └─Linear: 5-17       543
│    │    │    │    └─Reshape: 5-18      --
│    │    └─GRUFeatureModule: 3-4        --
│    │    │    └─GRU: 4-7                117
│    │    │    └─Sequential: 4-8         --
│    │    │    │    └─Reshape: 5-19      --
│    │    │    │    └─Conv1d: 5-20       100
│    │    │    │    └─GELU: 5-21         --
│    │    │    │    └─Flatten: 5-22      --
│    │    │    │    └─Linear: 5-23       543
│    │    │    │    └─Reshape: 5-24      --
│    │    └─GRUFeatureModule: 3-5        --
│    │    │    └─GRU: 4-9                90
│    │    │    └─Sequential: 4-10        --
│    │    │    │    └─Reshape: 5-25      --
│    │    │    │    └─Conv1d: 5-26       100
│    │    │    │    └─GELU: 5-27         --
│    │    │    │    └─Flatten: 5-28      --
│    │    │    │    └─Linear: 5-29       543
│    │    │    │    └─Reshape: 5-30      --
│    │    └─GRUFeatureModule: 3-6        --
│    │    │    └─GRU: 4-11               153
│    │    │    └─Sequential: 4-12        --
│    │    │    │    └─Reshape: 5-31      --
│    │    │    │    └─Conv1d: 5-32       100
│    │    │    │    └─GELU: 5-33         --
│    │    │    │    └─Flatten: 5-34      --
│    │    │    │    └─Linear: 5-35       543
│    │    │    │    └─Reshape: 5-36      --
│    │    └─GRUFeatureModule: 3-7        --
│    │    │    └─GRU: 4-13               81
│    │    │    └─Sequential: 4-14        --
│    │    │    │    └─Reshape: 5-37      --
│    │    │    │    └─Conv1d: 5-38       100
│    │    │    │    └─GELU: 5-39         --
│    │    │    │    └─Flatten: 5-40      --
│    │    │    │    └─Linear: 5-41       543
│    │    │    │    └─Reshape: 5-42      --
│    │    └─GRUFeatureModule: 3-8        --
│    │    │    └─GRU: 4-15               54
│    │    │    └─Sequential: 4-16        --
│    │    │    │    └─Reshape: 5-43      --
│    │    │    │    └─Conv1d: 5-44       100
│    │    │    │    └─GELU: 5-45         --
│    │    │    │    └─Flatten: 5-46      --
│    │    │    │    └─Linear: 5-47       543
│    │    │    │    └─Reshape: 5-48      --
│    │    └─GRUFeatureModule: 3-9        --
│    │    │    └─GRU: 4-17               63
│    │    │    └─Sequential: 4-18        --
│    │    │    │    └─Reshape: 5-49      --
│    │    │    │    └─Conv1d: 5-50       100
│    │    │    │    └─GELU: 5-51         --
│    │    │    │    └─Flatten: 5-52      --
│    │    │    │    └─Linear: 5-53       543
│    │    │    │    └─Reshape: 5-54      --
│    │    └─GRUFeatureModule: 3-10       --
│    │    │    └─GRU: 4-19               54
│    │    │    └─Sequential: 4-20        --
│    │    │    │    └─Reshape: 5-55      --
│    │    │    │    └─Conv1d: 5-56       100
│    │    │    │    └─GELU: 5-57         --
│    │    │    │    └─Flatten: 5-58      --
│    │    │    │    └─Linear: 5-59       543
│    │    │    │    └─Reshape: 5-60      --
│    │    └─GRUFeatureModule: 3-11       --
│    │    │    └─GRU: 4-21               54
│    │    │    └─Sequential: 4-22        --
│    │    │    │    └─Reshape: 5-61      --
│    │    │    │    └─Conv1d: 5-62       100
│    │    │    │    └─GELU: 5-63         --
│    │    │    │    └─Flatten: 5-64      --
│    │    │    │    └─Linear: 5-65       543
│    │    │    │    └─Reshape: 5-66      --
│    │    └─GRUFeatureModule: 3-12       --
│    │    │    └─GRU: 4-23               81
│    │    │    └─Sequential: 4-24        --
│    │    │    │    └─Reshape: 5-67      --
│    │    │    │    └─Conv1d: 5-68       100
│    │    │    │    └─GELU: 5-69         --
│    │    │    │    └─Flatten: 5-70      --
│    │    │    │    └─Linear: 5-71       543
│    │    │    │    └─Reshape: 5-72      --
│    │    └─GRUFeatureModule: 3-13       --
│    │    │    └─GRU: 4-25               54
│    │    │    └─Sequential: 4-26        --
│    │    │    │    └─Reshape: 5-73      --
│    │    │    │    └─Conv1d: 5-74       100
│    │    │    │    └─GELU: 5-75         --
│    │    │    │    └─Flatten: 5-76      --
│    │    │    │    └─Linear: 5-77       543
│    │    │    │    └─Reshape: 5-78      --
│    │    └─GRUFeatureModule: 3-14       --
│    │    │    └─GRU: 4-27               63
│    │    │    └─Sequential: 4-28        --
│    │    │    │    └─Reshape: 5-79      --
│    │    │    │    └─Conv1d: 5-80       100
│    │    │    │    └─GELU: 5-81         --
│    │    │    │    └─Flatten: 5-82      --
│    │    │    │    └─Linear: 5-83       543
│    │    │    │    └─Reshape: 5-84      --
│    │    └─GRUFeatureModule: 3-15       --
│    │    │    └─GRU: 4-29               243
│    │    │    └─Sequential: 4-30        --
│    │    │    │    └─Reshape: 5-85      --
│    │    │    │    └─Conv1d: 5-86       100
│    │    │    │    └─GELU: 5-87         --
│    │    │    │    └─Flatten: 5-88      --
│    │    │    │    └─Linear: 5-89       543
│    │    │    │    └─Reshape: 5-90      --
│    │    └─GRUFeatureModule: 3-16       --
│    │    │    └─GRU: 4-31               243
│    │    │    └─Sequential: 4-32        --
│    │    │    │    └─Reshape: 5-91      --
│    │    │    │    └─Conv1d: 5-92       100
│    │    │    │    └─GELU: 5-93         --
│    │    │    │    └─Flatten: 5-94      --
│    │    │    │    └─Linear: 5-95       543
│    │    │    │    └─Reshape: 5-96      --
│    │    └─GRUFeatureModule: 3-17       --
│    │    │    └─GRU: 4-33               153
│    │    │    └─Sequential: 4-34        --
│    │    │    │    └─Reshape: 5-97      --
│    │    │    │    └─Conv1d: 5-98       100
│    │    │    │    └─GELU: 5-99         --
│    │    │    │    └─Flatten: 5-100     --
│    │    │    │    └─Linear: 5-101      543
│    │    │    │    └─Reshape: 5-102     --
│    │    └─GRUFeatureModule: 3-18       --
│    │    │    └─GRU: 4-35               144
│    │    │    └─Sequential: 4-36        --
│    │    │    │    └─Reshape: 5-103     --
│    │    │    │    └─Conv1d: 5-104      100
│    │    │    │    └─GELU: 5-105        --
│    │    │    │    └─Flatten: 5-106     --
│    │    │    │    └─Linear: 5-107      543
│    │    │    │    └─Reshape: 5-108     --
│    │    └─GRUFeatureModule: 3-19       --
│    │    │    └─GRU: 4-37               81
│    │    │    └─Sequential: 4-38        --
│    │    │    │    └─Reshape: 5-109     --
│    │    │    │    └─Conv1d: 5-110      100
│    │    │    │    └─GELU: 5-111        --
│    │    │    │    └─Flatten: 5-112     --
│    │    │    │    └─Linear: 5-113      543
│    │    │    │    └─Reshape: 5-114     --
├─Identity: 1-2                          --
├─ModuleList: 1-3                        --
│    └─GATConv: 2-2                      --
│    │    └─Linear: 3-20                 36
│    │    └─Dropout: 3-21                --
│    │    └─Dropout: 3-22                --
│    │    └─LeakyReLU: 3-23              --
│    │    └─Linear: 3-24                 36
│    └─GATConv: 2-3                      --
│    │    └─Linear: 3-25                 144
│    │    └─Dropout: 3-26                --
│    │    └─Dropout: 3-27                --
│    │    └─LeakyReLU: 3-28              --
│    │    └─Identity: 3-29               --
│    └─GATConv: 2-4                      --
│    │    └─Linear: 3-30                 144
│    │    └─Dropout: 3-31                --
│    │    └─Dropout: 3-32                --
│    │    └─LeakyReLU: 3-33              --
│    │    └─Identity: 3-34               --
│    └─GATConv: 2-5                      --
│    │    └─Linear: 3-35                 144
│    │    └─Dropout: 3-36                --
│    │    └─Dropout: 3-37                --
│    │    └─LeakyReLU: 3-38              --
│    │    └─Identity: 3-39               --
│    └─GATConv: 2-6                      --
│    │    └─Linear: 3-40                 144
│    │    └─Dropout: 3-41                --
│    │    └─Dropout: 3-42                --
│    │    └─LeakyReLU: 3-43              --
│    │    └─Identity: 3-44               --
│    └─GATConv: 2-7                      --
│    │    └─Linear: 3-45                 144
│    │    └─Dropout: 3-46                --
│    │    └─Dropout: 3-47                --
│    │    └─LeakyReLU: 3-48              --
│    │    └─Identity: 3-49               --
│    └─GATConv: 2-8                      --
│    │    └─Linear: 3-50                 144
│    │    └─Dropout: 3-51                --
│    │    └─Dropout: 3-52                --
│    │    └─LeakyReLU: 3-53              --
│    │    └─Identity: 3-54               --
│    └─GATConv: 2-9                      --
│    │    └─Linear: 3-55                 144
│    │    └─Dropout: 3-56                --
│    │    └─Dropout: 3-57                --
│    │    └─LeakyReLU: 3-58              --
│    │    └─Identity: 3-59               --
├─NodeWeightPredictor: 1-4               --
│    └─Sequential: 2-10                  --
│    │    └─Linear: 3-60                 1,664
│    │    └─GELU: 3-61                   --
│    │    └─Linear: 3-62                 128
=================================================================
Total params: 17,267
Trainable params: 17,267
Non-trainable params: 0
=======================================================================================================================================================

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
preprocess metrics for each instance type: 100%|████████████████████████████████████████| 19/19 [00:06<00:00,  2.95it/s]
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs

  | Name    | Type | Params
---------------------------------
0 | _module | GAT  | 17.6 K
---------------------------------
17.6 K    Trainable params
0         Non-trainable params
17.6 K    Total params
0.070     Total estimated model params size (MB)
2022-09-03 03:02:52.634 | INFO     | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=0     val_loss=1.0618     A@1=0.00 % A@2=0.00 % A@3=0.00 % A@5=0.00 % MAR=47.71
2022-09-03 03:02:54.752 | INFO     | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=0     loss=1.2861    A@1=7.27 % A@2=14.55% A@3=16.36% A@5=16.36% MAR=36.62
2022-09-03 03:02:59.639 | INFO     | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=4     loss=0.5538    A@1=16.36% A@2=21.82% A@3=47.27% A@5=49.09% MAR=8.98
2022-09-03 03:03:04.952 | INFO     | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=8     loss=0.1951    A@1=78.18% A@2=80.00% A@3=83.64% A@5=96.36% MAR=2.18
2022-09-03 03:03:06.374 | INFO     | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=9     val_loss=0.4763     A@1=29.41% A@2=47.06% A@3=70.59% A@5=82.35% MAR=3.82
Metric val_loss improved. New best score: 0.476
2022-09-03 03:03:07.945 | INFO     | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=10    loss=0.1300    A@1=76.36% A@2=83.64% A@3=90.91% A@5=96.36% MAR=1.56
2022-09-03 03:03:14.117 | INFO     | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=15    loss=0.0551    A@1=89.09% A@2=100.00% A@3=100.00% A@5=100.00% MAR=1.11
2022-09-03 03:03:19.016 | INFO     | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=19    val_loss=0.2789     A@1=64.71% A@2=88.24% A@3=94.12% A@5=94.12% MAR=2.12
Metric val_loss improved by 0.197 >= min_delta = 0.0. New best score: 0.279
2022-09-03 03:03:19.274 | INFO     | utils.callbacks:on_fit_end:106 - Average epoch time: 1.33
2022-09-03 03:03:19.275 | INFO     | DejaVu.workflow:_train_exp_CFL:99 - trainer.checkpoint_callback.best_model_path='/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt'
2022-09-03 03:03:19.848 | INFO     | DejaVu.workflow:_train_exp_CFL:100 - {'command_line': 'python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20', 'time': 'Sat Sep  3 03:03:19 2022', 'git_root': '/workspace', 'git_url': 'https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a', 'git_has_uncommitted_changes': False}
2022-09-03 03:03:19.863 | WARNING  | utils.load_model:best_checkpoint:35 - ckpt_path=PosixPath('/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/last.ckpt') not match
Restoring states from the checkpoint path at /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt
Loaded model weights from checkpoint at /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt
2022-09-03 03:03:21.327 | INFO     | DejaVu.models.interface.callbacks:on_test_epoch_end:107 -
A@1=53.12% A@2=90.62% A@3=100.00% A@5=100.00% MAR=1.56
|id  |     |FR |AR |recurring|timestamp                |root cause                                                            |rank-1              |rank-2              |rank-3              |
|65  |✅    |  1|  1|True     |2020-05-30T04:13:00+08:00|docker_002 CPU                                                        |docker_002 CPU      |docker_002          |db_003 Session      |
|59  |✅    |  1|  1|True     |2020-05-29T03:41:00+08:00|docker_001 CPU                                                        |docker_001 CPU      |docker_008          |docker_007          |
|18  |✅    |  1|  1|False    |2020-05-23T00:05:00+08:00|docker_004 CPU                                                        |docker_004 CPU      |docker_004          |db_009              |
|50  |✅    |  1|  1|True     |2020-05-27T05:09:00+08:00|docker_001 CPU                                                        |docker_001 CPU      |os_020 Network      |docker_001          |
|44  |✅    |  1|  1|True     |2020-05-27T01:23:00+08:00|docker_006 CPU                                                        |docker_006 CPU      |docker_006          |docker_002 CPU      |
|41  |✅    |  1|  1|True     |2020-05-26T05:15:00+08:00|docker_002 CPU                                                        |docker_002 CPU      |docker_002          |os_021              |
|8   |✅    |  1|  1|True     |2020-04-11T04:40:00+08:00|docker_008 CPU                                                        |docker_008 CPU      |docker_008          |db_007 Session      |
|52  |✅    |  1|  1|True     |2020-05-28T00:47:00+08:00|docker_001 CPU                                                        |docker_001 CPU      |docker_001          |os_021              |
|24  |❌    |  2|  2|True     |2020-05-23T05:20:00+08:00|docker_005                                                            |docker_005 CPU      |docker_005          |docker_003          |
|31  |✅    |  1|  1|True     |2020-05-24T04:47:00+08:00|docker_004                                                            |docker_004          |os_021 Network      |docker_004 CPU      |
|28  |❌    |  3|  3|True     |2020-05-24T02:47:00+08:00|docker_002                                                            |db_007 Session      |docker_002 CPU      |docker_002          |
|77  |❌    |  2|  2|True     |2020-05-31T05:48:00+08:00|docker_003                                                            |docker_003 CPU      |docker_003          |db_007 Session      |
|48  |✅    |  1|  1|True     |2020-05-27T03:23:00+08:00|docker_001                                                            |docker_001          |os_022 Network      |os_021              |
|49  |❌    |  2|  2|True     |2020-05-27T04:39:00+08:00|docker_007                                                            |docker_007 CPU      |docker_007          |db_007 Session      |
|16  |❌    |  3|  3|True     |2020-05-22T05:18:00+08:00|docker_007                                                            |docker_003 CPU      |docker_007 CPU      |docker_007          |
|67  |❌    |  2|  2|True     |2020-05-30T05:43:00+08:00|docker_002                                                            |os_022 Network      |docker_002          |db_007 Session      |
|76  |❌    |  2|  2|True     |2020-05-31T04:47:00+08:00|docker_006                                                            |docker_006 CPU      |docker_006          |docker_004 CPU      |
|62  |❌    |  2|  2|True     |2020-05-30T00:43:00+08:00|docker_005                                                            |docker_005 CPU      |docker_005          |docker_004 CPU      |
|60  |❌    |  2|  2|True     |2020-05-29T05:11:00+08:00|docker_006                                                            |docker_006 CPU      |docker_006          |docker_004 CPU      |
|69  |❌    |  2|  2|True     |2020-05-31T00:47:00+08:00|docker_001                                                            |os_022 Network      |docker_001          |db_007 Session      |
|19  |❌    |  3|  3|False    |2020-05-23T00:40:00+08:00|db_003 Session                                                        |db_003 Load         |db_003              |db_003 Session      |
|4   |❌    |  2|  2|True     |2020-04-11T02:15:00+08:00|db_007 Session                                                        |db_007 Load         |db_007 Session      |db_007              |
|57  |❌    |  2|  2|False    |2020-05-29T02:11:00+08:00|db_003 Session                                                        |os_021 Network      |db_003 Session      |db_007 Session      |
|34  |✅    |  1|  1|True     |2020-05-25T04:47:00+08:00|db_003 State                                                          |db_003 State        |db_007 Session      |db_003 Session      |
|10  |✅    |  1|  1|True     |2020-04-11T05:45:00+08:00|db_003 State                                                          |db_003 State        |db_007 Session      |os_017              |
|40  |✅    |  1|  1|True     |2020-05-26T04:15:00+08:00|os_020 Network                                                        |os_020 Network      |os_021 Network      |docker_004          |
|14  |✅    |  1|  1|True     |2020-05-22T01:48:00+08:00|os_018 Network                                                        |os_018 Network      |docker_002          |os_018              |
|33  |❌    |  2|  2|False    |2020-05-25T03:47:00+08:00|os_017 Network                                                        |os_019 Network      |os_017 Network      |docker_008 CPU      |
|66  |❌    |  2|  2|True     |2020-05-30T05:13:00+08:00|os_018 Network                                                        |os_022 Network      |os_018 Network      |docker_002          |
|43  |✅    |  1|  1|False    |2020-05-27T00:53:00+08:00|os_017 Network                                                        |os_017 Network      |docker_001          |docker_002          |
|73  |✅    |  1|  1|False    |2020-05-31T03:17:00+08:00|os_017 Network                                                        |os_017 Network      |docker_005          |docker_001          |
|72  |✅    |  1|  1|True     |2020-05-31T02:47:00+08:00|os_021 Network                                                        |os_021 Network      |os_021              |os_022              |

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
           A@1                    0.53125
           A@2                    0.90625
           A@3                      1.0
           A@5                      1.0
           MAR                    1.5625
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
2022-09-03 03:03:21.358 | INFO     | DejaVu.workflow:<lambda>:27 - Time Report:
|path                                                                                                   |%total     |%parent    |count   |total      |mean(±std)              |min-max              |
|/train_exp_CFL                                                                                         |    100.00%|    100.00%|       1|    46.105s|    46.105(±     0.000)s|    46.105~    46.105|
|/train_exp_CFL/DejaVuDataset.__getitem__                                                               |      8.96%|      8.96%|    1183|     4.132s|     0.003(±     0.011)s|     0.000~     0.134|
|/train_exp_CFL/DejaVuDataset.__getitem__/MetricPreprocessor.__call__                                   |      0.59%|      6.54%|      78|     0.270s|     0.003(±     0.003)s|     0.002~     0.021|
|/train_exp_CFL/DejaVuDataset.__getitem__/_get_global_id_getter                                         |      0.00%|      0.00%|       1|     0.000s|     0.000(±     0.000)s|     0.000~     0.000|
|/train_exp_CFL/DejaVuDataset.__init__                                                                  |      0.01%|      0.01%|       6|     0.003s|     0.001(±     0.001)s|     0.000~     0.003|
|/train_exp_CFL/DejaVuModelInterface.get_collate_fn.<locals>.collate_fn                                 |      0.37%|      0.37%|      84|     0.171s|     0.002(±     0.001)s|     0.001~     0.009|
|/train_exp_CFL/DejaVuModelInterface.test_step                                                          |      0.21%|      0.21%|       1|     0.098s|     0.098(±     0.000)s|     0.098~     0.098|
|/train_exp_CFL/DejaVuModelInterface.training_step                                                      |     18.07%|     18.07%|      80|     8.332s|     0.104(±     0.022)s|     0.061~     0.165|
|/train_exp_CFL/DejaVuModelInterface.validation_step                                                    |      0.63%|      0.63%|       3|     0.290s|     0.097(±     0.012)s|     0.080~     0.105|
|/train_exp_CFL/Epoch Time                                                                              |     57.58%|     57.58%|      20|    26.546s|     1.327(±     0.241)s|     0.978~     2.106|
|/train_exp_CFL/FDG.load                                                                                |      7.28%|      7.28%|       1|     3.357s|     3.357(±     0.000)s|     3.357~     3.357|
|/train_exp_CFL/GAT.__init__                                                                            |      0.08%|      0.08%|       1|     0.036s|     0.036(±     0.000)s|     0.036~     0.036|
|/train_exp_CFL/MetricPreprocessor.extract_features                                                     |     14.21%|     14.21%|       1|     6.553s|     6.553(±     0.000)s|     6.553~     6.553|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter                                  |     13.94%|     98.10%|      19|     6.429s|     0.338(±     0.213)s|     0.105~     0.888|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/fill na                          |      2.28%|     16.38%|      19|     1.053s|     0.055(±     0.069)s|     0.005~     0.272|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/fill na/metric iter              |      0.63%|     27.67%|     710|     0.291s|     0.000(±     0.000)s|     0.000~     0.005|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get idx dict                     |      0.00%|      0.01%|      19|     0.001s|     0.000(±     0.000)s|     0.000~     0.000|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df               |     11.14%|     79.90%|      19|     5.136s|     0.270(±     0.134)s|     0.099~     0.564|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df/fill into feat|      6.20%|     55.68%|      19|     2.860s|     0.151(±     0.121)s|     0.000~     0.377|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df/index         |      4.81%|     43.17%|      19|     2.217s|     0.117(±     0.025)s|     0.090~     0.181|
|/train_exp_CFL/MetricPreprocessor.extract_features/ts select                                           |      0.15%|      1.04%|       1|     0.068s|     0.068(±     0.000)s|     0.068~     0.068|
|/train_exp_CFL/_get_global_id_resolver                                                                 |      0.00%|      0.00%|       1|     0.000s|     0.000(±     0.000)s|     0.000~     0.000|

2022-09-03 03:03:22.513 | INFO     | DejaVu.workflow:<lambda>:124 - command output one-line summary: 53.12,90.62,100.00,100.00,1.56,46.105078504000005,,,/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139,,,,python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20,https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a
train finished. saved to /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139
lizeyan commented 2 years ago

File structure:

image