KnowledgeDiscovery / rca_baselines

Code for "LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis" paper
https://lemma-rca.github.io/
Other
10 stars 2 forks source link
causal-discovery log-analysis multi-modal-learning online-learning root-cause-analysis time-series-analysis

LEMMA-RCA

Root cause analysis (RCA) is a task of identifying the underlying causes of system faults/failures by analyzing the system monitoring data. LEMMA-RCA is a collection of multi-modal datasets with various real system faults to facilitate future research in RCA. It is also a multi-domain dataset, encompassing real-world applications such as microservice and water treatment/distribution systems. The datasets are released under the CC BY-NC 4.0 license and hosted on Huggingface, the codes are available on Github.

drawing

Real System Faults

Each dataset contains various system faults simulated from real-world scenarios. For details, please check our website.

Multiple Domains and Dataset Download

LEMMA-RCA covers two domains and we provide both the raw data and preprocessed data. We release the dataset in Huggingface and the detailed data statistics can be found in Lemma-RCA Webpage.

Unified Evaluation

LEMMA-RCA datasets are evaluated with eight causal learning baselines in four settings: online/offline with single/multiple modality data.

Guideline for Evaluation

Example: Using FastPC to evalute the Performance of Case 20211203 in Product Review

Step 1: Download the Case 20211203 of the [preprocessed data from HuggingFace].

You need to download both log and metric data if you would like to test the performance of FastPC on multi-modal data.

Notice: If you want to use metric data only, you can skip step 2 to step 5 and move directly to step 6 to detect root cause with metric data.

Step 2: Use the code in IT folder to preprocess the log data.

cd ./IT/data_preprocessing

Step 3: Extract useful log information (such as pod/node names, log messages, etc.) from original elasticsearch log (json format)

python json2message.py

Notice: Some of the arguments may need to change

    --path, the input directory of the json format log data
    --output_dir, the output directory of all log messages
    --output_dir2, the output directory of pod-level log messages for each pod
    --output_dir3, the output directory of node-level log messages for each node

Step 4: Usa Drain to parse both node-level and pod-level log messages

python drain3_parse.py ./output/log_prep_node/  -o "./drain3_result/node"

python drain3_parse.py ./output/log_prep_pod/   -o "./drain3_result/pod"
    --input_dir, default="./output/log_prep_node/" or "./output/log_prep_pod/"
    --output_dir, default="./drain3_result/node"   or "./drain3_result/pod"

Step 5: Log feature extraction

python log_frequency_extraction.py --log_dir ./input_path/  --output_dir ./output_path

python log_golden_frequency.py --root_path ./input_path/  --output_dir ./output_path --save_dir ./output_path

Step 6. Evalute the performance of FastPC on the Case 20211203 with metric data only:

If you encounter the error regarding "name 'LIBSPOT' is not defined", please double-check if you are running the code in the directory of FastPC. We observe such an error if the command is 'python FastPC/test_FastPC_pod_log.py' running in the directory of './rca_baselines/Baseline/offline/'.

Step 7. Check the results

The results will be stored in the csv file as follows:

./Baseline/offline/output/Pod_level_combine_ranking.csv

The root cause for 20211203 (MongoDB-v1) can be found in the readme.pptx file in the folder of downloaded preprocessed data.

Citation

If you use LEMMA-RCA in your work, please cite our paper:

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen: LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis. CoRR abs/2406.05375 (2024)

References

[1] Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, Haifeng Chen: Incremental Causal Graph Learning for Online Root Cause Analysis. KDD 2023: 2269-2278.

[2] Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, Haifeng Chen: Interdependent Causal Networks for Root Cause Localization. KDD 2023: 5051-5060.

License

Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 International License

You can not use the code and data for commercial purposes.