Some question about loading dataset,expecially hdfs.

LogIntelligence / LogADEmpirical

Log-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)

MIT License

168 stars 39 forks source link

Some question about loading dataset,expecially hdfs. #6

Open yzbrlan opened 2 years ago

yzbrlan commented 2 years ago

I have some questions about loading dataset,expecially hdfs.

RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?

The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust. I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".

My question is why do you use "history_size" to fix the data sequence, including session window HDFS ? As a result, the length of the final training dataset sequence is "history_size".

Is there any mistake in my understanding? Can you help me answer it?

RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?

vanhoanglepsa commented 2 years ago

Hi,

R1: To perform online detection at the log entry-level, for each log sequence to be detected, we apply a sliding window with the size of "history_size" and the step size of 1. "window_size" is used to group logs by fixed windows (applied for BGL, Spirit, and Thunderbird).

R2: There are 3 main ways to generate log sequences (i.e., load dataset). Fixed/Sliding/Session windows, please refer to these papers ([1], [2]) for more details. Each dataset will have different ways to load it due to the labeling process. For example, on HDFS, we can only apply session windows. For others, we can use fixed or sliding windows.

benpaobamingliang1 commented 2 years ago

I have some questions about loading dataset,expecially hdfs.

RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?

The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust. I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".

My question is why do you use "history_size" to fix the data sequence, including session window HDFS ? As a result, the length of the final training dataset sequence is "history_size".

Is there any mistake in my understanding? Can you help me answer it?

RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?

您好，请问你知道如何运行嘛，为什么我运行不成功呢