LogAnalysisTeam / ml4logs

Machine Learning methods for log file processing
MIT License
0 stars 3 forks source link

Installation

  1. Clone the source: https://github.com/LogAnalysisTeam/ml4logs

  2. Activate your virtual environment (conda, venv).

  3. Either install the package as usual:

python setup.py install

or in development regime:

python setup.py develop

Usage

Various pipelines are run using batches in scripts/. We suggest to run the scripts via Makefile:

make COMMAND_NAME

The scripts support SLURM cluster batch scheduler. Set ML4LOGS_SHELL environment variable to sbatch in case you perform the experiments on the cluster. See RCI Quick Start for full details on how to setup development environment.

If init_environment.sh script exists in the project root directory, it is sourced (via bash source command) prior running any batch in scripts/. Use it to set up your virtual environment, scheduler modules, etc.

Run Benchmark on HDFS1 (100k lines)

Run Benchmark on HDFS1

Results

The following table (generated using script) shows the current LAD method leaderboard for the HDFS1 dataset. The methods are sorted by decreasing F1 score.

Unsupervised/Semi-Supervised Methods

Method Preprocess Precision Recall F1 MCC
PCA Drain3 0.849 0.809 0.828 0.824
Isolation Forest (sklearn) Drain3 0.808 0.800 0.804 0.798
Local Outlier Factor (sklearn) Drain3 0.429 0.928 0.587 0.616
Isolation Forest (sklearn) fastText block-max 0.989 0.364 0.532 0.594
PCA fastText block-max 0.380 0.384 0.382 0.363
Local Outlier Factor (sklearn) fastText block-max 0.258 0.014 0.027 0.055

Supervised Methods

Method Preprocess Precision Recall F1 MCC
Decision Tree Drain3 0.997 0.999 0.998 0.998
Logistic Regression Drain3 0.980 0.995 0.988 0.987
LSTM M2O fastText 0.992 0.471 0.639 0.678
Decision Tree fastText block-max 0.614 0.634 0.624 0.612
Logistic Regression fastText block-max 0.911 0.420 0.575 0.612
Linear SVC fastText block-max 0.948 0.387 0.550 0.599
Linear SVC Drain3 1.000 0.230 0.375 0.475
LSTM M2M fastText 0.874 0.111 0.197 0.309

Notes:

Scripts and Configuration Files

data

drain_preprocess

fasttext_preprocess

drain_loglizer

Trains and tests models which are specified by loglizer on Drain-parsed dataset. These are:

fasttext_loglizer

Trains and tests loglizer specified models for aggregated fastText embeddings:

fasttext_seq2seq

Results

TODO put result tables here

Data Files Description

Block-Level Labeled Datasets (e.g., HDFS)

N - Number of log lines
B - Number of blocks (e.g. blk_ in HDFS)
E - Number of event ids (e.g. extracted by drain)
F - Embedding dimension (e.g. fasttext)
data
├── interim
│   └── {DATASET_NAME}
│       ├── blocks.npy                  (N, )       Block ids
│       ├── fasttext-timedeltas.npy     (N, F + 1)  Fasttext embeddings with timedeltas
│       ├── fasttext.npy                (N, F)      Fasttext embeddings
│       ├── ibm_drain-eventids.npy      (N, )       Event ids
│       ├── ibm_drain-templates.csv     (E, )       Event ids, their templates and occurrences
│       ├── labels.npy                  (B, )       Labels (1 stands for anomaly, 0 for normal)
│       ├── logs.txt                                Raw logs
│       └── timedeltas.npy              (N, )       Timedeltas
├── processed
│   └── {DATASET_NAME}
│       ├── fasttext-average.npz        (B, F + 1)  Fasttext embeddings with timedeltas aggregated by blocks
│       └── ibm_drain.npz               (B, E)      Count vectors
└── raw
    └── {DATASET_NAME}
        ├── {ARCHIVE}.tar.gz
        └── Dataset specific files

References

[1] M. Souček, "Log Anomaly Detection", master thesis, Czech Technical University in Prague, 2020.