About Sampling (or Feature Extraction)

Hi!

I think section 3B of this paper (Chinese edition at here) may help people understand those sampling methods.

B. Feature Extraction

The main purpose of this step is to extract valuable features from log events that could be fed into anomaly detection models. The input of feature extraction is log events generated in the log parsing step, and the output is an event count matrix. In order to extract features, we firstly need to separate log data into various groups, where each group represents a log sequence. To do so, windowing is applied to divide a log dataset into finite chunks [5]. As illustrated in Figure 1, we use three different types of windows: fixed windows, sliding windows, and session windows.

Fixed window: Both fixed windows and sliding windows are based on timestamp, which records the occurrence time of each log. Each fixed window has its size, which means the time span or time duration. As shown in Figure 1, the window size is Δt, which is a constant value, such as one hour or one day. Thus, the number of fixed windows depends on the predefined window size. Logs that happened in the same window are regarded as a log sequence.

Sliding window: Different from fixed windows, sliding windows consist of two attributes: window size and step size, e.g., hourly windows sliding every five minutes. In general, step size is smaller than window size, therefore causing the overlap of different windows. Figure 1 shows that the window size is ΔT , while the step size is the forwarding distance. The number of sliding windows, which is often larger than fixed windows, mainly depends on both window size and step size. Logs that occurred in the same sliding window are also grouped as a log sequence, though logs may duplicate in multiple sliding windows due to the overlap.

Session window: Compared with the above two windowing types, session windows are based on identifiers instead of the timestamp. Identifiers are utilized to mark different execution paths in some log data. For instance, HDFS logs with block_id record the allocation, writing, replication, deletion of certain block. Thus, we can group logs according to the identifiers, where each session window has a unique identifier.

After constructing the log sequences with windowing techniques, an event count matrix X is generated. In each log sequence, we count the occurence number of each log event to form the event count vector. For example, if the event count vector is [0, 0, 2, 3, 0, 1, 0], it means that event 3 occurred twice and event 4 occurred three times in this log sequence. Finally, plenty of event count vectors are constructed to be an event count matrix X, where entry Xi,j records how many times the event j occurred in the i-th log sequence.

B. 特征提取

该步骤的主要目的是从日志事件中提取有价值的特征，这些特征可以被输入异常检测模型。特征提取的输入是日志解析步骤中生成的日志事件，输出是事件计数矩阵。为了提取特征，我们首先需要将日志数据分成不同的组，其中每个组代表一个日志序列。为此，窗口被应用于将日志数据集划分成有限块。如图1所示，我们使用三种不同类型的窗口:固定窗口、滑动窗口和会话窗口

固定窗口 固定窗口和滑动窗口都基于时间戳，时间戳记录每个日志的发生时间。每个固定窗口都有其大小，这意味着时间跨度或持续时间。如图1所示，窗口大小为∆t，这是一个常量值，例如一小时或一天。因此，固定窗口的数量取决于预定义的窗口大小。同一窗口中发生的日志被视为日志序列

滑动窗口 与固定窗口不同，滑动窗口由两个属性组成:窗口大小和步长，例如，每小时窗口每五分钟滑动一次。通常，步长小于窗口大小，因此会导致不同窗口的重叠。图1显示了窗口大小是∆T，而步长是转发距离。滑动窗口的数量通常大于固定窗口，主要取决于窗口大小和步长。发生在同一滑动窗口中的日志也被分组为日志序列，尽管由于重叠，日志可能会在多个滑动窗口中重复

会话窗口 与上述两种窗口类型相比，会话窗口基于标识符而不是时间戳。标识符用于在一些日志数据中标记不同的执行路径。例如，带有block_id的HDFS日志记录了某些数据块的分配、写入、复制和删除。因此，我们可以根据标识符对日志进行分组，其中每个会话窗口都有一个唯一的标识符。

在利用窗口技术构建日志序列之后，生成事件计数矩阵X。在每个日志序列中，我们计算每个日志事件的发生次数，以形成事件计数向量。例如，如果事件计数向量是[ 0、0、2、3、0、1、0 ]，这意味着在这个日志序列中，事件3发生了两次，事件4发生了三次。最后，大量事件计数向量被构造成事件计数矩阵X，其中条目Xi, j记录了事件j在第i个日志序列中发生了多少次。

d0ng1ee / logdeep

About Sampling (or Feature Extraction) #5