Root cause analysis (RCA) is a task of identifying the underlying causes of system faults/failures by analyzing the system monitoring data. LEMMA-RCA is a collection of multi-modal datasets with various real system faults to facilitate future research in RCA. It is also a multi-domain dataset, encompassing real-world applications such as microservice and water treatment/distribution systems. The datasets are released under the CC BY-NC 4.0 license and hosted on Huggingface, the codes are available on Github.
Each dataset contains various system faults simulated from real-world scenarios. For details, please check our website.
LEMMA-RCA covers two domains and we provide both the raw data and preprocessed data. We release the dataset in Huggingface and the detailed data statistics can be found in Lemma-RCA Webpage.
cd ./IT/data_preprocessing
If you want to directly test the performance of these baseline methods, you may choose to download the preprocessed data.
LEMMA-RCA datasets are evaluated with eight causal learning baselines in four settings: online/offline with single/multiple modality data.
Example: Using FastPC to evalute the Performance of Case 20211203 in Product Review
You need to download both log and metric data if you would like to test the performance of FastPC on multi-modal data.
cd ./IT/data_preprocessing
python json2message.py
Notice: Some of the arguments may need to change
--path, the input directory of the json format log data
--output_dir, the output directory of all log messages
--output_dir2, the output directory of pod-level log messages for each pod
--output_dir3, the output directory of node-level log messages for each node
python drain3_parse.py ./output/log_prep_node/ -o "./drain3_result/node"
python drain3_parse.py ./output/log_prep_pod/ -o "./drain3_result/pod"
--input_dir, default="./output/log_prep_node/" or "./output/log_prep_pod/"
--output_dir, default="./drain3_result/node" or "./drain3_result/pod"
python log_frequency_extraction.py --log_dir ./input_path/ --output_dir ./output_path
python log_golden_frequency.py --root_path ./input_path/ --output_dir ./output_path --save_dir ./output_path
python test_FastPC_pod_metric.py --dataset 20211203 --path_dir CHANGE_PATH_TO_DATASET_DIRECTORY --output_dir CHANGE_PATH_TO_OUTPUT_DIRECTORY
You may also test the performance of FastPC with log data or two modalities with the following command:
python test_FastPC_pod_log.py ## for log data only
python test_FastPC_pod_combine.py ## for both metric and log data
The results will be stored in the csv file as follows:
./Baseline/offline/output/Pod_level_combine_ranking.csv
The root cause for 20211203 (MongoDB-v1) can be found in the readme.pptx file in the folder of downloaded preprocessed data.
If you use LEMMA-RCA in your work, please cite our paper:
Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen: LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis. CoRR abs/2406.05375 (2024)
[1] Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, Haifeng Chen: Incremental Causal Graph Learning for Online Root Cause Analysis. KDD 2023: 2269-2278.
[2] Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, Haifeng Chen: Interdependent Causal Networks for Root Cause Localization. KDD 2023: 5051-5060.
Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 International License
You can not use the code and data for commercial purposes.