Clone the source:
https://github.com/LogAnalysisTeam/ml4logs
Activate your virtual environment (conda, venv).
Either install the package as usual:
python setup.py install
or in development regime:
python setup.py develop
Various pipelines are run using batches in scripts/
. We suggest to run the scripts via Makefile
:
make COMMAND_NAME
The scripts support SLURM cluster batch scheduler. Set ML4LOGS_SHELL
environment variable to sbatch
in case you perform the experiments on the cluster. See RCI Quick Start for full details on how to setup development environment.
If init_environment.sh
script exists in the project root directory, it is source
d (via bash
source
command) prior running any batch in scripts/
. Use it to set up your virtual environment, scheduler modules, etc.
make hdfs1_100k_data
make hdfs1_100k_preprocess
make hdfs1_100k_train_test
make hdfs1_data
make hdfs1_preprocess
make hdfs1_train_test
The following table (generated using script) shows the current LAD method leaderboard for the HDFS1 dataset. The methods are sorted by decreasing F1 score.
Method | Preprocess | Precision | Recall | F1 | MCC |
---|---|---|---|---|---|
PCA | Drain3 | 0.849 | 0.809 | 0.828 | 0.824 |
Isolation Forest (sklearn ) |
Drain3 | 0.808 | 0.800 | 0.804 | 0.798 |
Local Outlier Factor (sklearn ) |
Drain3 | 0.429 | 0.928 | 0.587 | 0.616 |
Isolation Forest (sklearn ) |
fastText block-max | 0.989 | 0.364 | 0.532 | 0.594 |
PCA | fastText block-max | 0.380 | 0.384 | 0.382 | 0.363 |
Local Outlier Factor (sklearn ) |
fastText block-max | 0.258 | 0.014 | 0.027 | 0.055 |
Method | Preprocess | Precision | Recall | F1 | MCC |
---|---|---|---|---|---|
Decision Tree | Drain3 | 0.997 | 0.999 | 0.998 | 0.998 |
Logistic Regression | Drain3 | 0.980 | 0.995 | 0.988 | 0.987 |
LSTM M2O | fastText | 0.992 | 0.471 | 0.639 | 0.678 |
Decision Tree | fastText block-max | 0.614 | 0.634 | 0.624 | 0.612 |
Logistic Regression | fastText block-max | 0.911 | 0.420 | 0.575 | 0.612 |
Linear SVC | fastText block-max | 0.948 | 0.387 | 0.550 | 0.599 |
Linear SVC | Drain3 | 1.000 | 0.230 | 0.375 | 0.475 |
LSTM M2M | fastText | 0.874 | 0.111 | 0.197 | 0.309 |
Notes:
configs/
directory)data
drain_preprocess
fasttext_preprocess
drain_loglizer
Trains and tests models which are specified by loglizer on Drain-parsed dataset. These are:
fasttext_loglizer
Trains and tests loglizer specified models for aggregated fastText embeddings:
fasttext_seq2seq
TODO put result tables here
N - Number of log lines
B - Number of blocks (e.g. blk_ in HDFS)
E - Number of event ids (e.g. extracted by drain)
F - Embedding dimension (e.g. fasttext)
data
├── interim
│ └── {DATASET_NAME}
│ ├── blocks.npy (N, ) Block ids
│ ├── fasttext-timedeltas.npy (N, F + 1) Fasttext embeddings with timedeltas
│ ├── fasttext.npy (N, F) Fasttext embeddings
│ ├── ibm_drain-eventids.npy (N, ) Event ids
│ ├── ibm_drain-templates.csv (E, ) Event ids, their templates and occurrences
│ ├── labels.npy (B, ) Labels (1 stands for anomaly, 0 for normal)
│ ├── logs.txt Raw logs
│ └── timedeltas.npy (N, ) Timedeltas
├── processed
│ └── {DATASET_NAME}
│ ├── fasttext-average.npz (B, F + 1) Fasttext embeddings with timedeltas aggregated by blocks
│ └── ibm_drain.npz (B, E) Count vectors
└── raw
└── {DATASET_NAME}
├── {ARCHIVE}.tar.gz
└── Dataset specific files
[1] M. Souček, "Log Anomaly Detection", master thesis, Czech Technical University in Prague, 2020.